[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Serhiy Storchaka

Serhiy Storchaka storch...@gmail.com added the comment:

Any chance to commit the patch before final feature freeze?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Any chance to commit the patch before final feature freeze?

I'll defer to Mark :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Okay, will look at this this afternoon.

--
assignee:  - mark.dickinson

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

I'm happy to apply the 'decode_utf8_range_check.patch';  I'll do that unless 
there are objections.  The code is clearer than the original, and if we get a 
speedup into the bargain then I don't see a reason not to apply this.

I'm less comfortable with either the original patch, or the most recent one 
(decode_utf8_signed_byte-2.patch).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Serhiy, does this patch also fix #8271?
If so, can you also include the tests I wrote for it?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

Patch applied.  Closing.

Ezio:  the patch is pure optimization, with no change in semantics;  I don't 
see how it could fix #8271.

--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset 3214c9ebcf5e by Mark Dickinson in branch 'default':
Issue #14923: Optimize continuation-byte check in UTF-8 decoding.  Patch by 
Serhiy Storchaka.
http://hg.python.org/cpython/rev/3214c9ebcf5e

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-23 Thread Serhiy Storchaka

Serhiy Storchaka storch...@gmail.com added the comment:

 Serhiy, does this patch also fix #8271?

No, this patch not change behavior. But updated patch for issue 8271 now
contains this patch (I hope this will help merge).

 If so, can you also include the tests I wrote for it?

Your tests included in patch for issue 8271.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-06-22 Thread Serhiy Storchaka

Serhiy Storchaka storch...@gmail.com added the comment:

Here is a patch that uses some sort of autodetection.

--
Added file: http://bugs.python.org/file26098/decode_utf8_signed_byte-2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-28 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 It seems the patch relies on a two's complement representation of
 integers. Mark, do you think that's ok?

(1) Relying on two's complement integers seems fine to me:  we're already 
relying on it in other places in Python (e.g., bitwise operations for ints in 
Python 2.x); it seems unlikely Python's going to run into current or future 
hardware that uses anything else; and any special-case code for ones' 
complement or sign-magnitude is going to be essentially unused and awkward to 
test, so it's probably better not to have it in the codebase at all.

In an ideal world, I guess we'd add some configure-time tests for two's 
complement so that in the unlikely event that Python *does* meet a non two's 
complement platform the build fails early with a clear message rather than the 
tests failing in strange ways.

(2) The bit that Martin identifies: relying on conversion from unsigned to 
signed doing a wraparound modulo 2**suitable n is a bit more troubling;  it's 
something that I try to avoid where possible, but that's not always easy.  I 
don't recall ever having encountered this causing problems in practice, 
though---it feels like a leftover from a non-two's complement world where 
processors would have a hard time giving wraparound semantics, so the C 
standard didn't want to require it.  Again, if we're going to rely on this, it 
would probably make sense to have some configure-time checks; and it would be 
better not to rely on it at all without a really good reason.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-27 Thread Serhiy Storchaka

Serhiy Storchaka storch...@gmail.com added the comment:

Yes, this is an implementation-dependent behavior (and on the supported 
platforms it is implemented well in a certain way).

However, if the continuation byte check to do the simplest way ((ch) = 0x80  
(ch)  0xC0), this has the same effect (speed up to +45%) on AMD Athlon.

  vanilla  patched

utf-8 'A'*1   2061 (-2%)   2018
utf-8 '\x80'*1383 (+9%)416
utf-8   '\x80'+'A'*   1273 (+3%)   1315
utf-8 '\u0100'*1  382 (+46%)   558
utf-8   '\u0100'+'A'* 1239 (+0%)   1245
utf-8   '\u0100'+'\x80'*  383 (+46%)   558
utf-8 '\u8000'*1  434 (-6%)408
utf-8   '\u8000'+'A'* 1245 (+0%)   1245
utf-8   '\u8000'+'\x80'*  382 (+46%)   556
utf-8   '\u8000'+'\u0100'*383 (+45%)   556
utf-8 '\U0001'*1  358 (+0%)359
utf-8   '\U0001'+'A'* 1171 (-0%)   1170
utf-8   '\U0001'+'\x80'*  381 (+30%)   495
utf-8   '\U0001'+'\u0100'*381 (+30%)   495
utf-8   '\U0001'+'\u8000'*404 (-5%)385

On Intel Atom the results did not change or become a little better.

  vanilla  patched

utf-8 'A'*1   623 (+3%)642
utf-8 '\x80'*1145 (+9%)158
utf-8   '\x80'+'A'*   354 (+4%)367
utf-8 '\u0100'*1  164 (+0%)164
utf-8   '\u0100'+'A'* 343 (+2%)351
utf-8   '\u0100'+'\x80'*  164 (+1%)165
utf-8 '\u8000'*1  175 (-2%)171
utf-8   '\u8000'+'A'* 349 (+3%)359
utf-8   '\u8000'+'\x80'*  164 (+0%)164
utf-8   '\u8000'+'\u0100'*164 (+0%)164
utf-8 '\U0001'*1  152 (-1%)150
utf-8   '\U0001'+'A'* 313 (+2%)319
utf-8   '\U0001'+'\x80'*  161 (+1%)162
utf-8   '\U0001'+'\u0100'*161 (+1%)162
utf-8   '\U0001'+'\u8000'*160 (-2%)156

--
Added file: http://bugs.python.org/file25733/decode_utf8_range_check.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-27 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 However, if the continuation byte check to do the simplest way ((ch) = 
 0x80  (ch)  0xC0), this has the same effect (speed up to +45%) on 
 AMD Athlon.

Doesn't produce any significant speedup on Intel Core i5-2500.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Serhiy Storchaka

New submission from Serhiy Storchaka storch...@gmail.com:

As strange as it may seem, but using a simple trick was made UTF-8 decoding 
even more speed up.

Here are the benchmark results.

On 32-bit Linux, AMD Athlon 64 X2:

  vanilla  patched

utf-8 'A'*1   2061 (+3%)   2115
utf-8 '\x80'*1383 (-7%)355
utf-8   '\x80'+'A'*   1273 (+1%)   1290
utf-8 '\u0100'*1  382 (+47%)   562
utf-8   '\u0100'+'A'* 1239 (+1%)   1253
utf-8   '\u0100'+'\x80'*  383 (+47%)   562
utf-8 '\u8000'*1  434 (-6%)409
utf-8   '\u8000'+'A'* 1245 (+1%)   1256
utf-8   '\u8000'+'\x80'*  382 (+47%)   560
utf-8   '\u8000'+'\u0100'*383 (+44%)   553
utf-8 '\U0001'*1  358 (+4%)373
utf-8   '\U0001'+'A'* 1171 (+0%)   1176
utf-8   '\U0001'+'\x80'*  381 (+44%)   548
utf-8   '\U0001'+'\u0100'*381 (+44%)   548
utf-8   '\U0001'+'\u8000'*404 (+0%)406

On 32-bit Linux, Intel Atom N570:

  vanilla  patched

utf-8 'A'*1   623 (+0%)626
utf-8 '\x80'*1145 (+15%)   167
utf-8   '\x80'+'A'*   354 (+2%)362
utf-8 '\u0100'*1  164 (+10%)   181
utf-8   '\u0100'+'A'* 343 (-0%)342
utf-8   '\u0100'+'\x80'*  164 (+11%)   182
utf-8 '\u8000'*1  175 (+5%)183
utf-8   '\u8000'+'A'* 349 (+0%)349
utf-8   '\u8000'+'\x80'*  164 (+11%)   182
utf-8   '\u8000'+'\u0100'*164 (+10%)   181
utf-8 '\U0001'*1  152 (+11%)   168
utf-8   '\U0001'+'A'* 313 (+0%)313
utf-8   '\U0001'+'\x80'*  161 (+11%)   179
utf-8   '\U0001'+'\u0100'*161 (+11%)   179
utf-8   '\U0001'+'\u8000'*160 (+11%)   177

--
components: Interpreter Core, Unicode
files: decode_utf8_signed_byte.patch
keywords: patch
messages: 161652
nosy: Arfrever, ezio.melotti, haypo, janssen, jcea, loewis, mark.dickinson, 
ned.deily, pitrou, python-dev, ronaldoussoren, storchaka
priority: normal
severity: normal
status: open
title: Even faster UTF-8 decoding
type: performance
versions: Python 3.3
Added file: http://bugs.python.org/file25717/decode_utf8_signed_byte.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file25718/decodebench.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file25719/bench-diff.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

I see a slight increase under 64-bit Linux with gcc 4.5.2, too:

  vanilla   patched

utf-8 'A'*1   7857 (+4%)8210
utf-8 'A'*+'\x80' 5392 (+8%)5843
utf-8 'A'*+'\u0100'   2119 (+3%)2173
utf-8 'A'*+'\u8000'   2121 (+2%)2172
utf-8 'A'*+'\U0001'   2248 (+2%)2293
utf-8 '\x80'*11015 (+1%)1021
utf-8   '\x80'+'A'*   2747 (+5%)2877
utf-8 '\x80'*+'\u0100'868 (+0%) 869
utf-8 '\x80'*+'\u8000'857 (+2%) 870
utf-8 '\x80'*+'\U0001'877 (+0%) 881
utf-8 '\u0100'*1  1016 (+16%)   1181
utf-8   '\u0100'+'A'* 2506 (+3%)2592
utf-8   '\u0100'+'\x80'*  1015 (+16%)   1179
utf-8 '\u0100'*+'\u8000'  1015 (+16%)   1182
utf-8 '\u0100'*+'\U0001'  875 (+13%)992
utf-8 '\u8000'*1  836 (+18%)985
utf-8   '\u8000'+'A'* 2508 (+3%)2588
utf-8   '\u8000'+'\x80'*  1015 (+16%)   1182
utf-8   '\u8000'+'\u0100'*1014 (+17%)   1182
utf-8 '\u8000'*+'\U0001'  767 (+17%)894
utf-8 '\U0001'*1  730 (+0%) 732
utf-8   '\U0001'+'A'* 2542 (+2%)2599
utf-8   '\U0001'+'\x80'*  1013 (+17%)   1182
utf-8   '\U0001'+'\u0100'*1013 (+17%)   1181
utf-8   '\U0001'+'\u8000'*727 (+0%) 728

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

It seems the patch relies on a two's complement representation of integers. 
Mark, do you think that's ok?

--
stage:  - commit review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
stage: commit review - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Serhiy Storchaka

Serhiy Storchaka storch...@gmail.com added the comment:

 It seems the patch relies on a two's complement representation of integers. 
 Mark, do you think that's ok?

Yes, the patch depends on two facts -- 8-bit bytes and a two's
complement representation of integers. That's why I call it a trick.
However, today CPython will not work on other platforms. However, we can
wrap macro definition in #if/#else/#end and provide the traditional form
(but I don't remember how to test a two's complement representation in
compile time).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14923] Even faster UTF-8 decoding

2012-05-26 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

The C standard says, in 6.3.1.3/3

Otherwise [*], the new type is signed and the value cannot be represented in 
it; either the result is implementation-defined or an implementation-defined 
signal is raised.

[*]: the value cannot be exactly converted, and the target type is not unsigned.

We shouldn't be using unsigned-signed conversions where the source value is 
out of range for the signed type.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14923
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com