[issue14923] Even faster UTF-8 decoding
Serhiy Storchaka storch...@gmail.com added the comment: Any chance to commit the patch before final feature freeze? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Antoine Pitrou pit...@free.fr added the comment: Any chance to commit the patch before final feature freeze? I'll defer to Mark :-) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Mark Dickinson dicki...@gmail.com added the comment: Okay, will look at this this afternoon. -- assignee: - mark.dickinson ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Mark Dickinson dicki...@gmail.com added the comment: I'm happy to apply the 'decode_utf8_range_check.patch'; I'll do that unless there are objections. The code is clearer than the original, and if we get a speedup into the bargain then I don't see a reason not to apply this. I'm less comfortable with either the original patch, or the most recent one (decode_utf8_signed_byte-2.patch). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Ezio Melotti ezio.melo...@gmail.com added the comment: Serhiy, does this patch also fix #8271? If so, can you also include the tests I wrote for it? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Mark Dickinson dicki...@gmail.com added the comment: Patch applied. Closing. Ezio: the patch is pure optimization, with no change in semantics; I don't see how it could fix #8271. -- resolution: - fixed status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Roundup Robot devn...@psf.upfronthosting.co.za added the comment: New changeset 3214c9ebcf5e by Mark Dickinson in branch 'default': Issue #14923: Optimize continuation-byte check in UTF-8 decoding. Patch by Serhiy Storchaka. http://hg.python.org/cpython/rev/3214c9ebcf5e -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Serhiy Storchaka storch...@gmail.com added the comment: Serhiy, does this patch also fix #8271? No, this patch not change behavior. But updated patch for issue 8271 now contains this patch (I hope this will help merge). If so, can you also include the tests I wrote for it? Your tests included in patch for issue 8271. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Serhiy Storchaka storch...@gmail.com added the comment: Here is a patch that uses some sort of autodetection. -- Added file: http://bugs.python.org/file26098/decode_utf8_signed_byte-2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Mark Dickinson dicki...@gmail.com added the comment: It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok? (1) Relying on two's complement integers seems fine to me: we're already relying on it in other places in Python (e.g., bitwise operations for ints in Python 2.x); it seems unlikely Python's going to run into current or future hardware that uses anything else; and any special-case code for ones' complement or sign-magnitude is going to be essentially unused and awkward to test, so it's probably better not to have it in the codebase at all. In an ideal world, I guess we'd add some configure-time tests for two's complement so that in the unlikely event that Python *does* meet a non two's complement platform the build fails early with a clear message rather than the tests failing in strange ways. (2) The bit that Martin identifies: relying on conversion from unsigned to signed doing a wraparound modulo 2**suitable n is a bit more troubling; it's something that I try to avoid where possible, but that's not always easy. I don't recall ever having encountered this causing problems in practice, though---it feels like a leftover from a non-two's complement world where processors would have a hard time giving wraparound semantics, so the C standard didn't want to require it. Again, if we're going to rely on this, it would probably make sense to have some configure-time checks; and it would be better not to rely on it at all without a really good reason. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Serhiy Storchaka storch...@gmail.com added the comment: Yes, this is an implementation-dependent behavior (and on the supported platforms it is implemented well in a certain way). However, if the continuation byte check to do the simplest way ((ch) = 0x80 (ch) 0xC0), this has the same effect (speed up to +45%) on AMD Athlon. vanilla patched utf-8 'A'*1 2061 (-2%) 2018 utf-8 '\x80'*1383 (+9%)416 utf-8 '\x80'+'A'* 1273 (+3%) 1315 utf-8 '\u0100'*1 382 (+46%) 558 utf-8 '\u0100'+'A'* 1239 (+0%) 1245 utf-8 '\u0100'+'\x80'* 383 (+46%) 558 utf-8 '\u8000'*1 434 (-6%)408 utf-8 '\u8000'+'A'* 1245 (+0%) 1245 utf-8 '\u8000'+'\x80'* 382 (+46%) 556 utf-8 '\u8000'+'\u0100'*383 (+45%) 556 utf-8 '\U0001'*1 358 (+0%)359 utf-8 '\U0001'+'A'* 1171 (-0%) 1170 utf-8 '\U0001'+'\x80'* 381 (+30%) 495 utf-8 '\U0001'+'\u0100'*381 (+30%) 495 utf-8 '\U0001'+'\u8000'*404 (-5%)385 On Intel Atom the results did not change or become a little better. vanilla patched utf-8 'A'*1 623 (+3%)642 utf-8 '\x80'*1145 (+9%)158 utf-8 '\x80'+'A'* 354 (+4%)367 utf-8 '\u0100'*1 164 (+0%)164 utf-8 '\u0100'+'A'* 343 (+2%)351 utf-8 '\u0100'+'\x80'* 164 (+1%)165 utf-8 '\u8000'*1 175 (-2%)171 utf-8 '\u8000'+'A'* 349 (+3%)359 utf-8 '\u8000'+'\x80'* 164 (+0%)164 utf-8 '\u8000'+'\u0100'*164 (+0%)164 utf-8 '\U0001'*1 152 (-1%)150 utf-8 '\U0001'+'A'* 313 (+2%)319 utf-8 '\U0001'+'\x80'* 161 (+1%)162 utf-8 '\U0001'+'\u0100'*161 (+1%)162 utf-8 '\U0001'+'\u8000'*160 (-2%)156 -- Added file: http://bugs.python.org/file25733/decode_utf8_range_check.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Antoine Pitrou pit...@free.fr added the comment: However, if the continuation byte check to do the simplest way ((ch) = 0x80 (ch) 0xC0), this has the same effect (speed up to +45%) on AMD Athlon. Doesn't produce any significant speedup on Intel Core i5-2500. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
New submission from Serhiy Storchaka storch...@gmail.com: As strange as it may seem, but using a simple trick was made UTF-8 decoding even more speed up. Here are the benchmark results. On 32-bit Linux, AMD Athlon 64 X2: vanilla patched utf-8 'A'*1 2061 (+3%) 2115 utf-8 '\x80'*1383 (-7%)355 utf-8 '\x80'+'A'* 1273 (+1%) 1290 utf-8 '\u0100'*1 382 (+47%) 562 utf-8 '\u0100'+'A'* 1239 (+1%) 1253 utf-8 '\u0100'+'\x80'* 383 (+47%) 562 utf-8 '\u8000'*1 434 (-6%)409 utf-8 '\u8000'+'A'* 1245 (+1%) 1256 utf-8 '\u8000'+'\x80'* 382 (+47%) 560 utf-8 '\u8000'+'\u0100'*383 (+44%) 553 utf-8 '\U0001'*1 358 (+4%)373 utf-8 '\U0001'+'A'* 1171 (+0%) 1176 utf-8 '\U0001'+'\x80'* 381 (+44%) 548 utf-8 '\U0001'+'\u0100'*381 (+44%) 548 utf-8 '\U0001'+'\u8000'*404 (+0%)406 On 32-bit Linux, Intel Atom N570: vanilla patched utf-8 'A'*1 623 (+0%)626 utf-8 '\x80'*1145 (+15%) 167 utf-8 '\x80'+'A'* 354 (+2%)362 utf-8 '\u0100'*1 164 (+10%) 181 utf-8 '\u0100'+'A'* 343 (-0%)342 utf-8 '\u0100'+'\x80'* 164 (+11%) 182 utf-8 '\u8000'*1 175 (+5%)183 utf-8 '\u8000'+'A'* 349 (+0%)349 utf-8 '\u8000'+'\x80'* 164 (+11%) 182 utf-8 '\u8000'+'\u0100'*164 (+10%) 181 utf-8 '\U0001'*1 152 (+11%) 168 utf-8 '\U0001'+'A'* 313 (+0%)313 utf-8 '\U0001'+'\x80'* 161 (+11%) 179 utf-8 '\U0001'+'\u0100'*161 (+11%) 179 utf-8 '\U0001'+'\u8000'*160 (+11%) 177 -- components: Interpreter Core, Unicode files: decode_utf8_signed_byte.patch keywords: patch messages: 161652 nosy: Arfrever, ezio.melotti, haypo, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, storchaka priority: normal severity: normal status: open title: Even faster UTF-8 decoding type: performance versions: Python 3.3 Added file: http://bugs.python.org/file25717/decode_utf8_signed_byte.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file25718/decodebench.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file25719/bench-diff.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Antoine Pitrou pit...@free.fr added the comment: I see a slight increase under 64-bit Linux with gcc 4.5.2, too: vanilla patched utf-8 'A'*1 7857 (+4%)8210 utf-8 'A'*+'\x80' 5392 (+8%)5843 utf-8 'A'*+'\u0100' 2119 (+3%)2173 utf-8 'A'*+'\u8000' 2121 (+2%)2172 utf-8 'A'*+'\U0001' 2248 (+2%)2293 utf-8 '\x80'*11015 (+1%)1021 utf-8 '\x80'+'A'* 2747 (+5%)2877 utf-8 '\x80'*+'\u0100'868 (+0%) 869 utf-8 '\x80'*+'\u8000'857 (+2%) 870 utf-8 '\x80'*+'\U0001'877 (+0%) 881 utf-8 '\u0100'*1 1016 (+16%) 1181 utf-8 '\u0100'+'A'* 2506 (+3%)2592 utf-8 '\u0100'+'\x80'* 1015 (+16%) 1179 utf-8 '\u0100'*+'\u8000' 1015 (+16%) 1182 utf-8 '\u0100'*+'\U0001' 875 (+13%)992 utf-8 '\u8000'*1 836 (+18%)985 utf-8 '\u8000'+'A'* 2508 (+3%)2588 utf-8 '\u8000'+'\x80'* 1015 (+16%) 1182 utf-8 '\u8000'+'\u0100'*1014 (+17%) 1182 utf-8 '\u8000'*+'\U0001' 767 (+17%)894 utf-8 '\U0001'*1 730 (+0%) 732 utf-8 '\U0001'+'A'* 2542 (+2%)2599 utf-8 '\U0001'+'\x80'* 1013 (+17%) 1182 utf-8 '\U0001'+'\u0100'*1013 (+17%) 1181 utf-8 '\U0001'+'\u8000'*727 (+0%) 728 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Antoine Pitrou pit...@free.fr added the comment: It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok? -- stage: - commit review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Changes by Antoine Pitrou pit...@free.fr: -- stage: commit review - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Serhiy Storchaka storch...@gmail.com added the comment: It seems the patch relies on a two's complement representation of integers. Mark, do you think that's ok? Yes, the patch depends on two facts -- 8-bit bytes and a two's complement representation of integers. That's why I call it a trick. However, today CPython will not work on other platforms. However, we can wrap macro definition in #if/#else/#end and provide the traditional form (but I don't remember how to test a two's complement representation in compile time). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14923] Even faster UTF-8 decoding
Martin v. Löwis mar...@v.loewis.de added the comment: The C standard says, in 6.3.1.3/3 Otherwise [*], the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised. [*]: the value cannot be exactly converted, and the target type is not unsigned. We shouldn't be using unsigned-signed conversions where the source value is out of range for the signed type. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue14923 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com