[issue31484] Cache single-character strings outside of the Latin1 range

2020-10-24 Thread Serhiy Storchaka
Change by Serhiy Storchaka : -- resolution: -> rejected stage: patch review -> resolved status: open -> closed ___ Python tracker ___

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-18 Thread Antoine Pitrou
Antoine Pitrou added the comment: Judging by the numbers, this optimization does not sound worth the hassle. It is quite rare to iterate over all characters in a long text while doing so little work with them that the cost of iteration is significant. By the way: > Sorry for using the word

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-17 Thread Xiang Zhang
Xiang Zhang added the comment: I run the patch against a toy NLP application, cutting words from Shui Hu Zhuan provided by Serhiy. The result is not bad, 6% faster. And I also count the hit rate, 90% hit cell 0, 4.5 hit cell 1, 5.5% miss. I also increase the cache size to 1024 * 2. Although

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > > The cache of size 2 x 256 slots can increase memory consumption by 50 KiB > > in worst case, 2 x 1024 -- by 200 KiB. > How much is this compared to the total usage? For Python interpreter VmSize: 31784 kB, VmRSS: 7900 kB. The cache doesn't affect

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-16 Thread Ezio Melotti
Ezio Melotti added the comment: > The cache of size 2 x 256 slots can increase memory consumption by 50 KiB in > worst case, 2 x 1024 -- by 200 KiB. How much is this compared to the total usage? > But I don't know how common `for c in s` or `s[i]` is used for Japanese text. I think the same

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-16 Thread Terry J. Reedy
Terry J. Reedy added the comment: If I understand correctly, anyone could change the cache size for their personal or corporate binary by changing #define BMP_CACHE_SIZE 256 There should be a comment that it must not be 0 and should be a power of 2 at least, say, 256. --

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-16 Thread INADA Naoki
INADA Naoki added the comment: Interesting optimization. But I don't know how common `for c in s` or `s[i]` is used for Japanese text. -- ___ Python tracker

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-16 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Initially I used 2 x 128 slots. It is enough for single block alphabetic languages. But it was caused significant slow down for Chinese. Increasing the size to 2 x 256 compensates the overhead for Chinese and restores the performance. If it is appropriate

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-15 Thread Ezio Melotti
Ezio Melotti added the comment: The Greek sample includes 155 unique characters (including whitespace, punctuation, and the english characters at the beginning), so they can all fit in the cache. The Chinese sample however includes 3695 unique characters (all within the BMP), probably causing

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-15 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Sorry for using the word hieroglyphs for Chinese characters. I didn't know how they are named in English. Limiting the caching to U+31BF doesn't have effect. But increasing the cache size to 2 x 512 slots makes the iteration of Chinese text faster by around

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-15 Thread Terry J. Reedy
Terry J. Reedy added the comment: I looked at the Gutenburg samples. The first has a short intro with some English, then is pure Greek. The patch is clearly good for anyone using mostly a single block alphabetic language. The second is Chinese, not hieroglyphs (ancient Egyptian). A

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-15 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Yes, of course. There is not much sense to benchmark a debug build. Tested code is many times slower in debug builds due to complex consistency checking. -- ___ Python tracker

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-15 Thread Terry J. Reedy
Terry J. Reedy added the comment: Are the timings for normal builds, with, as I understand things, asserts turned off? -- nosy: +terry.reedy ___ Python tracker

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-15 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : -- keywords: +patch pull_requests: +3592 ___ Python tracker ___

[issue31484] Cache single-character strings outside of the Latin1 range

2017-09-15 Thread Serhiy Storchaka
New submission from Serhiy Storchaka: Single-character strings in the Latin1 range (U+ - U+00FF) are shared in CPython. This saves memory and CPU time of per-character processing of strings containing ASCII characters and characters from Latin based alphabets. But the users of languages