[issue16311] Use _PyUnicodeWriter API in text decoders

2012-11-07 Thread STINNER Victor

STINNER Victor added the comment:

Oh, I forgot my benchmark results.

decodebench.py result results on Linux 32 bits:
(Linux-3.2.0-32-generic-pae-i686-with-debian-wheezy-sid)

$ ./python bench-diff.py original writer
ascii 'A'*1   4109 (-3%)3974

latin1'A'*1   3851 (-5%)3644
latin1'\x80'*114832 (-3%)   14430

utf-8 'A'*1   3747 (-4%)3608
utf-8 '\x80'*1976 (-2%) 961
utf-8 '\u0100'*1  974 (-2%) 959
utf-8 '\u8000'*1  804 (-14%)694
utf-8 '\U0001'*1  666 (-5%) 635

utf-16le  'A'*1   4154 (-1%)4117
utf-16le  '\x80'*14055 (-2%)3988
utf-16le  '\u0100'*1  4047 (-2%)3974
utf-16le  '\u8000'*1  917 (-1%) 912
utf-16le  '\U0001'*1  872 (-0%) 870

utf-16be  'A'*1   3218 (-1%)3185
utf-16be  '\x80'*13163 (-2%)3114
utf-16be  '\u0100'*1  2591 (-1%)2556
utf-16be  '\u8000'*1  979 (-1%) 974
utf-16be  '\U0001'*1  928 (-0%) 925

utf-32le  'A'*1   1681 (+12%)   1885
utf-32le  '\x80'*11697 (+10%)   1865
utf-32le  '\u0100'*1  2224 (+1%)2254
utf-32le  '\u8000'*1  2224 (+2%)2269
utf-32le  '\U0001'*1  2234 (+1%)2260

utf-32be  'A'*1   1685 (+11%)   1868
utf-32be  '\x80'*11684 (+10%)   1860
utf-32be  '\u0100'*1  2223 (+1%)2253
utf-32be  '\u8000'*1   (+1%)2255
utf-32be  '\U0001'*1  2243 (+1%)2257

decodebench.py result results on Linux 64 bits:
(Linux-3.4.9-2.fc16.x86_64-x86_64-with-fedora-16-Verne)

ascii 'A'*1   10043 (+1%)   10144

latin1'A'*1   8351 (-1%)8258
latin1'\x80'*119184 (+2%)   19560

utf-8 'A'*1   8083 (+5%)8461
utf-8 '\x80'*1982 (+1%) 993
utf-8 '\u0100'*1  984 (+1%) 992
utf-8 '\u8000'*1  806 (+31%)1053
utf-8 '\U0001'*1  639 (+12%)718

utf-16le  'A'*1   5547 (-2%)5422
utf-16le  '\x80'*15205 (+1%)5271
utf-16le  '\u0100'*1  4900 (-4%)4695
utf-16le  '\u8000'*1  1062 (+9%)1154
utf-16le  '\U0001'*1  1040 (+4%)1078

utf-16be  'A'*1   5416 (-5%)5157
utf-16be  '\x80'*15077 (-1%)5011
utf-16be  '\u0100'*1  4261 (-1%)4218
utf-16be  '\u8000'*1  1146 (+0%)1147
utf-16be  '\U0001'*1  1125 (-1%)1119

utf-32le  'A'*1   1743 (+8%)1880
utf-32le  '\x80'*11751 (+5%)1842
utf-32le  '\u0100'*1  2114 (+29%)   2721
utf-32le  '\u8000'*1  2120 (+28%)   2718
utf-32le  '\U0001'*1  2065 (+30%)   2690

utf-32be  'A'*1   1761 (+6%)1860
utf-32be  '\x80'*11749 (+6%)1856
utf-32be  '\u0100'*1  2101 (+29%)   2715
utf-32be  '\u8000'*1  2083 (+30%)   2715
utf-32be  '\U0001'*1  2058 (+31%)   2689

Most significant changes:
 * -14% to decode '\u8000'*1 from UTF-8 on Linux 32 bits
 * +31% to decode '\u8000'*1 from UTF-8 on Linux 32 bits
 * +28% to +31% to decode UCS-2 and UCS-4 characters from UTF-8 on Linux 32 bits

@Serhiy Storchaka: If you feel able to tune _PyUnicodeWriter to
improve its performance, please open a new issue.

I consider the performance changes acceptable and I don't plan to work
on this topic.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-11-06 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 7ed9993d53b4 by Victor Stinner in branch 'default':
Close #16311: Use the _PyUnicodeWriter API in text decoders
http://hg.python.org/cpython/rev/7ed9993d53b4

--
nosy: +python-dev
resolution:  - fixed
stage:  - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-31 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I updated the patch to resolve the conflict with issue14625.

--
Added file: http://bugs.python.org/file27806/codecs_writer_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-31 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file27807/codecs_writer_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-31 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: http://bugs.python.org/file27806/codecs_writer_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-31 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file27808/decodebench.res

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-31 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

With the patch UTF-8 decoder 20% slower for some data. UTF-16 decoder 20% 
faster for some data and 20% slower for other data. UTF-32 decoder slower for 
many data (even after some optimization, naive code was up to 50% slower). 
Standard charmap decoder 10% slower. Only UTF-7, unicode-escape and 
raw-unicode-escape have become much faster (unicode-escape and 
raw-unicode-escape as with issue16334 patch).

A well optimized decoders do not benefit from the _PyUnicodeWriter, only a 
slight slowdown. The patch requires some optimization (as for UTF-32 decoder) 
to reduce the negative effect. Non-optimized decoders will receive the great 
benefit.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-31 Thread STINNER Victor

STINNER Victor added the comment:

I ran decodebench.py and bench-diff.py scripts from #14624, I just
replaced repeat=10 with repeat=100 to get more reliable numbers. I
only see some performance regressions between -5% and -1%, but there
are some speedup on UTF-8 and UTF-32 (between +11% and +14%). On a
microbenchmark, numbers in the -10..10% range just means no change.

Using _PyUnicodeWriter should not change anything to performances on
valid data, only performances of handling decoding errors between the
overallocation factor is different, the code to widen the buffer and
the code to write replacement characters.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I will do some experiments and review tomorrow.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-29 Thread STINNER Victor

STINNER Victor added the comment:

Soon I'll post a patch, which speeds up unicode-escape and raw-unicode-escape 
decoders to 1.5-3x. Also there are not yet reviewed patches for UTF-32 
(issue14625) and charmap (issue14850) decoders. Will be merge conflicts.

codecs_writer.patch doesn't change too much the core of decoders, but mostly 
the code before and after the loop, and error handling. You can still use 
PyUnicode_WRITE, PyUnicode_READ, memcpy(), etc.

But I will review the patch.

If you review the patch, please check that how the buffer is allocated. It 
should not be overallocated by default, only on the first error. Overallocation 
can kill performances when it is not necessary (especially on Windows).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-24 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
nosy: +loewis, serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-24 Thread STINNER Victor

New submission from STINNER Victor:

Attached patch modifies text decoders to use the _PyUnicodeWriter API to 
factorize the code. It removes unicode_widen() and unicode_putchar() functions.

 * Don't overallocate by default  (except for raw-unicode-escape codec), 
enable overallocation on the first decode error (as done currently)
 * _PyUnicodeWriter_Prepare() only overallocates 25%, instead of 100%
for unicode_decode_call_errorhandler()
 * Use _PyUnicodeWriter_Prepare() + PyUnicode_WRITE() (two macros)
instead of unicode_putchar() (function)
 * _PyUnicodeWriter structures stores many useful fields, so we don't
have to pass multiple parameters to functions, only the writer

I wrote the patch to factorize the code, but it might be faster.

--
files: codecs_writer.patch
keywords: patch
messages: 173695
nosy: haypo
priority: normal
severity: normal
status: open
title: Use _PyUnicodeWriter API in text decoders
type: performance
versions: Python 3.4
Added file: http://bugs.python.org/file27697/codecs_writer.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16311] Use _PyUnicodeWriter API in text decoders

2012-10-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Soon I'll post a patch, which speeds up unicode-escape and raw-unicode-escape 
decoders to 1.5-3x. Also there are not yet reviewed patches for UTF-32 
(issue14625) and charmap (issue14850) decoders. Will be merge conflicts.

But I will review the patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue16311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com