[issue22649] Use _PyUnicodeWriter in case_operation()
Changes by STINNER Victor victor.stin...@gmail.com: -- resolution: - rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22649] Use _PyUnicodeWriter in case_operation()
Changes by Arfrever Frehtes Taifersar Arahesis arfrever@gmail.com: -- nosy: +Arfrever ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22649] Use _PyUnicodeWriter in case_operation()
Antoine Pitrou added the comment: Looks like it's cheaper to overallocate than add checks for overflow at each loop iteration. -- nosy: +pitrou ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22649] Use _PyUnicodeWriter in case_operation()
STINNER Victor added the comment: Looks like it's cheaper to overallocate than add checks for overflow at each loop iteration. I expected that the temporary Py_UCS4 buffer and the conversion to a Unicode object (Py_UCS1, Py_UCS2 or Py_UCS4) would be more expensive than _PyUnicodeWriter. It looks like it's slower. I tried to optimize the code but I didn't see how to make it really faster than the current code. -- Currently, the code uses: for (j = 0; j n_res; j++) { *maxchar = Py_MAX(*maxchar, mapped[j]); res[k++] = mapped[j]; } where res is a Py_UCS4* string, and mapped an array of 3 Py_UCS4. I replaced it with a call to case_operation_write() which calls _PyUnicodeWriter_WriteCharInline(). _PyUnicodeWriter_WriteCharInline() is maybe more expensive than res[k++] = mapped[j];. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22649] Use _PyUnicodeWriter in case_operation()
New submission from STINNER Victor: The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc. Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters. I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4). The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 = ASCII). Special characters taking the slow path from unit tests: - test_capitalize: 'finnish' = 'FInnish' (ascii) - test_casefold: 'ß' = 'ss', 'fi' = 'fi' - test_swapcase: 'fi' = 'FI', 'ß' = 'SS' - test_title: 'fiNNISH' = 'Finnish' - test_upper: 'fi' = 'FI', 'ß' = 'SS' The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes: - test_capitalize: 'ῳῳῼῼ' = 'ΩΙῳῳῳ', 'hİ' = 'Hi̇', 'ῒİ' = 'Ϊ̀i̇', 'finnish' = 'FInnish' - test_casefold: 'ß' = 'ss', 'fi' = 'fi' - test_lower: 'İ' = 'i̇' - test_swapcase: 'fi' = 'FI', 'İ' = 'i̇', 'ß' = 'SS', 'ῒ' = 'Ϊ̀' - test_title: 'fiNNISH' = 'Finnish' - test_upper: 'fi' = 'FI', 'ß' = 'SS', 'ῒ', 'Ϊ̀' -- files: case_writer.patch keywords: patch messages: 229497 nosy: haypo priority: normal severity: normal status: open title: Use _PyUnicodeWriter in case_operation() type: performance versions: Python 3.5 Added file: http://bugs.python.org/file36942/case_writer.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22649] Use _PyUnicodeWriter in case_operation()
Serhiy Storchaka added the comment: Add tests for 'µ' or 'ÿ' (upper maps UCS1 to UCS2), 'ΐ' or like (upper maps UCS2 to 3 UCS2), 'ffi' or 'ffl' (upper maps UCS2 to 3 ASCII), 'İ' (only one character for which lower doesn't map to 1 character), 'Å' (lower maps UCS2 to UCS1), any of Deseret or Warang Citi characters (UCS4). -- nosy: +serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22649] Use _PyUnicodeWriter in case_operation()
STINNER Victor added the comment: Benchmark: bench_case.py. Hum, case_writer.patch looks to be always slower: +--+ Summary | orig | writer +--+ lower with 'a' | 5.76 ms (*) | 5.76 ms lower with 'é' | 62.9 ms (*) | 76.8 ms (+22%) lower with '€' | 75.2 ms (*) | 83.6 ms (+11%) lower with 'fi' | 75.3 ms (*) | 83.7 ms (+11%) lower with 'ß' | 66.4 ms (*) |76 ms (+15%) upper with 'a' | 5.66 ms (*) | 5.66 ms upper with 'é' | 48.3 ms (*) | 75.9 ms (+57%) upper with '€' | 50.1 ms (*) | 77.9 ms (+55%) upper with 'fi' | 93.7 ms (*) | 137 ms (+46%) upper with 'ß' | 91.9 ms (*) | 119 ms (+29%) casefold with 'a' | 5.66 ms (*) | 5.67 ms casefold with 'é' | 64.5 ms (*) | 95.8 ms (+48%) casefold with '€' |67 ms (*) | 96.1 ms (+43%) casefold with 'fi' | 97.1 ms (*) | 132 ms (+35%) casefold with 'ß' | 93.7 ms (*) | 122 ms (+30%) swapcase with 'a' | 99.7 ms (*) |107 ms (+7%) swapcase with 'é' | 99.7 ms (*) |107 ms (+7%) swapcase with '€' |78 ms (*) | 87.4 ms (+12%) swapcase with 'fi' | 143 ms (*) |152 ms (+7%) swapcase with 'ß' | 140 ms (*) | 138 ms title with 'a' |82 ms (*) | 98.2 ms (+20%) title with 'é' | 81.9 ms (*) | 98.2 ms (+20%) title with '€' | 90.2 ms (*) | 115 ms (+28%) title with 'fi' | 93.9 ms (*) | 112 ms (+20%) title with 'ß' | 91.3 ms (*) | 103 ms (+13%) capitalize with 'a' | 62.3 ms (*) | 79.2 ms (+27%) capitalize with 'é' | 62.1 ms (*) | 79.1 ms (+27%) capitalize with '€' | 72.9 ms (*) | 76.5 ms capitalize with 'fi' | 72.6 ms (*) | 90.3 ms (+24%) capitalize with 'ß' | 69.5 ms (*) |80 ms (+15%) +--+ Total | 2.24 sec (*) | 2.71 sec (+21%) +--+ See bench.txt for the full output. -- Added file: http://bugs.python.org/file36943/bench_case.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22649] Use _PyUnicodeWriter in case_operation()
Changes by STINNER Victor victor.stin...@gmail.com: Added file: http://bugs.python.org/file36944/bench.txt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22649 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com