[issue24821] The optimization of string search can cause pessimization

2017-03-30 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:


New changeset 0a58f72762353768c7d26412e627ff196aac6c4e by Serhiy Storchaka in 
branch 'master':
bpo-24821: Fixed the slowing down to 25 times in the searching of some (#505)
https://github.com/python/cpython/commit/0a58f72762353768c7d26412e627ff196aac6c4e


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-29 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Thank you for testing Louie. Thank you for your review and testing Xiang.

> Would it completely kill performances to remove the optimization?

Yes. The optimization is not used if the lowest byte is 0. You can try to 
search "\0" for getting the result without the optimization. On my netbook the 
optimization speeds up searching by 5.5 times.

Search with optimization:
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Ж")'
Median +- std dev: 296 us +- 33 us

Search without optimization:
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("\0")'
Median +- std dev: 1.65 ms +- 0.10 ms

Search an unlucky character (unpatched):
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
Median +- std dev: 14.7 ms +- 1.8 ms

Search an unlucky character (patched):
$ ./python -m perf timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
Median +- std dev: 2.17 ms +- 0.24 ms

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-29 Thread STINNER Victor

STINNER Victor added the comment:

Would it completely kill performances to remove the optimization?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-29 Thread Ma Lin

Changes by Ma Lin :


--
nosy: +Ma Lin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-29 Thread Xiang Zhang

Xiang Zhang added the comment:

I can't give a "realistic" example. A more meaningful example may be:

'。', '\u3002', the Chinese period which used in almost every paragraph,
'地', '\u5730', which is a common used word.

./python3 -m perf timeit  -s 's = "你好,我叫李雷。"*1000' 's.find("地")'
Mean +- std dev: 6.65 us +- 0.27 us
Mean +- std dev: 4.08 us +- 0.22 us

It works! :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-29 Thread Louie Lu

Louie Lu added the comment:

I can now only test on Python3.6, providing much meaningful sentence,
still trying to use perf on cpython master branch.

---

$ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 
's.find("乎")'

Median +- std dev: 228 ns +- 7 ns


$ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 
's.find("乏")'  

Median +- std dev: 143 ns +- 3 ns

---

$ python -m perf timeit -s 's="一件乒乓事事亏, 不乏串連产業, 万丈一争今为举, 其哀哉"*1000' -- 
's.find("乎")'  
   ^^ 
(missing "乎")
Median +- std dev: 100 us +- 3 us

$ python -m perf timeit -s 's="一件乒乓事事亏, 不串連产業, 万丈一争今为举, 其乎哀哉"*1000' -- 
's.find("乏")'
 ^^ (missing "乏")
Median +- std dev: 1.67 us +- 0.05 us

--
nosy: +louielu

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-29 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Ukrainian "Є" is not the only character suffered from this issue. I suppose 
much CJK characters are suffered too. The problem is occurred when the lower 
byte of searched character matches the upper byte of most  characters in the 
text. For example, searching "乎" (U+4e4e) in the sequence of characters U+4eXX 
(but not containing U+4e4e and U+4e4f):

$ ./python -m perf timeit -s 's = 
"一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争
 
亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜
 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乎")'

Unpatched:  Median +- std dev: 761 us +- 108 us
Patched:Median +- std dev: 117 us +- 9 us

For comparison, searching "乏" (U+4e4f) in the same sequence:

$ ./python -m perf timeit -s 's = 
"一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争
 
亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜
 仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿"*100' -- 's.find("乏")'

Unpatched:  Median +- std dev: 12.8 us +- 1.4 us
Patched:Median +- std dev: 12.6 us +- 1.7 us

Sorry, I don't know Chinese or Japanese and can't provide more realistic 
examples.

--
nosy: +inada.naoki, xiang.zhang

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-06 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
pull_requests: +413

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2017-03-06 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
versions: +Python 3.7 -Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2016-09-12 Thread Ned Deily

Ned Deily added the comment:

If it has waited this long and is truly low priority, I think it can wait a 
little longer until 3.7.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2016-09-12 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +ned.deily

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2016-09-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I would commit the patch before beta 1, but:

1) The issue can be considered not as a new feature, but as a fix of 
performance bug (just we don't will to backport the fix to 3.5). I think the 
patch can be committed at the beta stage.

2) I would want to tune magic constants. I experimented on my two computers, 
and thing they are good enough, but would be nice to test them on other 
computers. I need to write automatic benchmark for running on foreign 
computers. This will take a time.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2016-09-06 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Could you please make a review Victor?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2016-06-21 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Ping.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-11-14 Thread STINNER Victor

STINNER Victor added the comment:

> The patch also makes a little refactoring. STRINGLIB(fastsearch_memchr_1char) 
> now is renamed and split on two functions STRINGLIB(find_char) and 
> STRINGLIB(rfind_char) with simpler interface.

Could you please push this part of the change? It looks good to me.

--

The C library provides a wmemchr() function which can be used to search for a 
wchar_t character inside a wchar_t* string.

* for 16-bit wchar_t (ex: Windows, AIX), it can be used for Python UCS2 strings
* for 32-bit wchar_t (ex: Linux, Mac OS X), it can be used for Python UCS4 
strings

I don't know if the type is signed or not. If the type is signed, we have to 
ensure that wmemchr() works as expected.

--

I didn't review carefully your new heuristic to search for a character.

Did you try to not use memchr() but a simple C loop for UCS-2 and UCS-4?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-11-14 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 1412be96faf0 by Serhiy Storchaka in branch 'default':
Issue #24821: Refactor STRINGLIB(fastsearch_memchr_1char) and split it on
https://hg.python.org/cpython/rev/1412be96faf0

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-11-14 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> Could you please push this part of the change? It looks good to me.

Done. The patch that make an optimization now looks much simpler.

> The C library provides a wmemchr() function which can be used to search for a 
> wchar_t character inside a wchar_t* string.

Yes, I know. But this is C99 function, hence we should add configure test for 
it, and I don't know if it exists on Windows. I did not test its performance.

> Did you try to not use memchr() but a simple C loop for UCS-2 and UCS-4?

A simple C loop is 25-50% slower. My patch tries to use memchr(), but if it 
founds false positive, runs a simple C loop for limited length. This makes a 
performance only insignificantly worse in the case of few false positives, but 
makes it much better in cases of often or grouped together false positives. The 
common case when there are no false positives is not affected.

MEMCHR_CUT_OFF is heuristic parameter. It is equal to the number of characters 
for which memchr()/memrchr() and a simple loop work the same time. On my 
machine it is 10-15 for UCS1 and 20-50 for UCS2 and UCS4 (depending on 
direction and character width).

--
Added file: http://bugs.python.org/file41040/find_char_false_positives.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-11-13 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Proposed patch makes the degenerate case less hard while preserves the 
optimization for common case.

$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("є")'
1000 loops, best of 3: 330 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("є")'
1000 loops, best of 3: 325 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
100 loops, best of 3: 7.81 msec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("Є")'
100 loops, best of 3: 8.5 msec per loop

$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("є")'
1000 loops, best of 3: 317 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("є")'
1000 loops, best of 3: 327 usec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.find("Є")'
1000 loops, best of 3: 1.1 msec per loop
$ ./python -m timeit -s 's = "АБВГД"*10**5' -- 's.rfind("Є")'
1000 loops, best of 3: 964 usec per loop

The slowdown is decreased from 25 times to 3 times.

The idea is that after memchr found false positive, make a tens iterations of 
simple loop before calling memchr again. This splits the cost of the memchr 
call with a tens of characters.

The patch also makes a little refactoring. STRINGLIB(fastsearch_memchr_1char) 
now is renamed and split on two functions STRINGLIB(find_char) and 
STRINGLIB(rfind_char) with simpler interface. All preconditional checks are 
moved into these functions. These functions now are directly used in other 
files.

--
keywords: +patch
resolution: remind -> 
stage: needs patch -> patch review
Added file: http://bugs.python.org/file41035/find_char.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-10-06 Thread Adam

Changes by Adam :


--
nosy: +azsorkin

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-09-10 Thread STINNER Victor

STINNER Victor added the comment:

> I think we should use more robust optimization.

What do you propose? I don't understand the purpose of the issue. Do you want 
to remove the optimization?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-09-10 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I was going to provide another optimization (I had an idea), but it is not so 
easy as looked to me at first glance. This issue exists rather as a reminder to 
me. I should make a second attempt.

--
resolution:  -> remind

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24821] The optimization of string search can cause pessimization

2015-08-07 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

Search in strings is highly optimized for common case. However for some input 
data the search in non-ascii string becomes unexpectedly slow. Compare:

$ ./python -m timeit -s 's = АБВГД*10**4' -- 'є in s'
10 loops, best of 3: 11.7 usec per loop
$ ./python -m timeit -s 's = АБВГД*10**4' -- 'Є in s'
1000 loops, best of 3: 769 usec per loop

It's because the lowest byte of the code of Ukrainian capital letter Є (U+0404) 
matches the highest byte of codes of most Cyrillic letters (U+04xx). There are 
similar issues with some other scripts.

I think we should use more robust optimization.

--
assignee: serhiy.storchaka
components: Interpreter Core
messages: 248179
nosy: haypo, pitrou, serhiy.storchaka
priority: low
severity: normal
stage: needs patch
status: open
title: The optimization of string search can cause pessimization
type: performance
versions: Python 3.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24821
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com