subject:"\[issue22649\] Use _PyUnicodeWriter in case

[issue22649] Use _PyUnicodeWriter in case_operation()

2015-04-06 Thread STINNER Victor


Changes by STINNER Victor :


--
resolution:  -> rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

2014-10-16 Thread STINNER Victor


STINNER Victor added the comment:

> Looks like it's cheaper to overallocate than add checks for overflow at each 
> loop iteration.

I expected that the temporary Py_UCS4 buffer and the conversion to a Unicode 
object (Py_UCS1, Py_UCS2 or Py_UCS4) would be more expensive than 
_PyUnicodeWriter. It looks like it's slower.

I tried to optimize the code but I didn't see how to make it really faster than 
the current code.

--

Currently, the code uses:

for (j = 0; j < n_res; j++) {
   *maxchar = Py_MAX(*maxchar, mapped[j]);
   res[k++] = mapped[j];
}

where res is a Py_UCS4* string, and mapped an array of 3 Py_UCS4.

I replaced it with a call to case_operation_write() which calls 
_PyUnicodeWriter_WriteCharInline().

_PyUnicodeWriter_WriteCharInline() is maybe more expensive than "res[k++] = 
mapped[j];".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

2014-10-16 Thread Antoine Pitrou


Antoine Pitrou added the comment:

Looks like it's cheaper to overallocate than add checks for overflow at each 
loop iteration.

--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

2014-10-16 Thread Arfrever Frehtes Taifersar Arahesis


Changes by Arfrever Frehtes Taifersar Arahesis :


--
nosy: +Arfrever

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

2014-10-15 Thread STINNER Victor


Changes by STINNER Victor :


Added file: http://bugs.python.org/file36944/bench.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

2014-10-15 Thread STINNER Victor


STINNER Victor added the comment:

Benchmark: bench_case.py. Hum, case_writer.patch looks to be always slower:

+--+
Summary | orig |  writer
+--+
lower with 'a'  |  5.76 ms (*) | 5.76 ms
lower with 'é'  |  62.9 ms (*) |  76.8 ms (+22%)
lower with '€'  |  75.2 ms (*) |  83.6 ms (+11%)
lower with 'ﬁ'  |  75.3 ms (*) |  83.7 ms (+11%)
lower with 'ß'  |  66.4 ms (*) |76 ms (+15%)
upper with 'a'  |  5.66 ms (*) | 5.66 ms
upper with 'é'  |  48.3 ms (*) |  75.9 ms (+57%)
upper with '€'  |  50.1 ms (*) |  77.9 ms (+55%)
upper with 'ﬁ'  |  93.7 ms (*) |   137 ms (+46%)
upper with 'ß'  |  91.9 ms (*) |   119 ms (+29%)
casefold with 'a'   |  5.66 ms (*) | 5.67 ms
casefold with 'é'   |  64.5 ms (*) |  95.8 ms (+48%)
casefold with '€'   |67 ms (*) |  96.1 ms (+43%)
casefold with 'ﬁ'   |  97.1 ms (*) |   132 ms (+35%)
casefold with 'ß'   |  93.7 ms (*) |   122 ms (+30%)
swapcase with 'a'   |  99.7 ms (*) |107 ms (+7%)
swapcase with 'é'   |  99.7 ms (*) |107 ms (+7%)
swapcase with '€'   |78 ms (*) |  87.4 ms (+12%)
swapcase with 'ﬁ'   |   143 ms (*) |152 ms (+7%)
swapcase with 'ß'   |   140 ms (*) |  138 ms
title with 'a'  |82 ms (*) |  98.2 ms (+20%)
title with 'é'  |  81.9 ms (*) |  98.2 ms (+20%)
title with '€'  |  90.2 ms (*) |   115 ms (+28%)
title with 'ﬁ'  |  93.9 ms (*) |   112 ms (+20%)
title with 'ß'  |  91.3 ms (*) |   103 ms (+13%)
capitalize with 'a' |  62.3 ms (*) |  79.2 ms (+27%)
capitalize with 'é' |  62.1 ms (*) |  79.1 ms (+27%)
capitalize with '€' |  72.9 ms (*) | 76.5 ms
capitalize with 'ﬁ' |  72.6 ms (*) |  90.3 ms (+24%)
capitalize with 'ß' |  69.5 ms (*) |80 ms (+15%)
+--+
Total   | 2.24 sec (*) | 2.71 sec (+21%)
+--+

See bench.txt for the full output.

--
Added file: http://bugs.python.org/file36943/bench_case.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

2014-10-15 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

Add tests for 'µ' or 'ÿ' (upper maps UCS1 to UCS2), 'ΐ' or like (upper maps 
UCS2 to 3 UCS2), 'ﬃ' or 'ﬄ' (upper maps UCS2 to 3 ASCII), 'İ' (only one 
character for which lower doesn't map to 1 character), 'Å' (lower maps UCS2 to 
UCS1), any of Deseret or Warang Citi characters (UCS4).

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

2014-10-15 Thread STINNER Victor


New submission from STINNER Victor:

The case_operation() in Objects/unicodeobject.c is used for case operations: 
lower, upper, casefold, etc.

Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 
300%. The function uses the worst case: one character replaced with 3 
characters.

I propose the use the _PyUnicodeWriter API to be able to optimize the most 
common case: each character is replaced by only one another character, and the 
output string uses the same unicode kind (UCS1, UCS2 or UCS4).

The patch preallocates the writer using the kind of the input string, but in 
some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" 
characters taking the slow path from unit tests:

- test_capitalize: 'ﬁnnish' => 'FInnish' (ascii)
- test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
- test_swapcase: 'ﬁ' => 'FI', 'ß' => 'SS'
- test_title: 'ﬁNNISH' => 'Finnish'
- test_upper: 'ﬁ' => 'FI', 'ß' => 'SS'

The writer only uses overallocation if a replaced character uses more than one 
character. Bad cases where the length changes:

- test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'ﬁnnish' 
=> 'FInnish'
- test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
- test_lower: 'İ' => 'i̇'
- test_swapcase: 'ﬁ' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀'
- test_title: 'ﬁNNISH' => 'Finnish'
- test_upper: 'ﬁ' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'

--
files: case_writer.patch
keywords: patch
messages: 229497
nosy: haypo
priority: normal
severity: normal
status: open
title: Use _PyUnicodeWriter in case_operation()
type: performance
versions: Python 3.5
Added file: http://bugs.python.org/file36942/case_writer.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22649] Use _PyUnicodeWriter in case_operation()

[issue22649] Use _PyUnicodeWriter in case_operation()

[issue22649] Use _PyUnicodeWriter in case_operation()

[issue22649] Use _PyUnicodeWriter in case_operation()

[issue22649] Use _PyUnicodeWriter in case_operation()

[issue22649] Use _PyUnicodeWriter in case_operation()

[issue22649] Use _PyUnicodeWriter in case_operation()

[issue22649] Use _PyUnicodeWriter in case_operation()

8 matches

Site Navigation

Mail list logo

Footer information