[issue41330] Inefficient error-handle for CJK encodings

2020-08-03 Thread STINNER Victor


STINNER Victor  added the comment:

(off topic)

> If nothing happens, I also would like to write a zstd module for stdlib 
> before the end of the year, but I dare not promise this.

I suggest you to publish it on PyPI. Once it will be mature, you can propose it 
on python-ideas. Last time someone proposed a new compression algorithm to the 
stdlib, it was rejected if I recall correctly. I forgot which one was proposed. 
Maybe search for "compresslib" on python-ideas.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-08-03 Thread Ma Lin


Ma Lin  added the comment:

I'm working on issue41265.
If nothing happens, I also would like to write a zstd module for stdlib before 
the end of the year, but I dare not promise this.

If anyone wants to work on this issue, very grateful.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-08-03 Thread STINNER Victor


STINNER Victor  added the comment:

Since CJK codecs have been implemented, unicodeobject.c got multiple 
optimizations:

* _PyUnicodeWriter for decoder: API designed with efficiency and PEP 393 
(compact string) in mind
* _PyBytesWriter for encoders: in short, API to overallocate a buffer
* _Py_error_handler enum and "_Py_error_handler _Py_GetErrorHandler(const char 
*errors)" function to pass an error handler as an integer rather than a string

But rewriting CJK codecs with these is a lot of effort, I'm not sure that it's 
worth it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-31 Thread Ma Lin


Ma Lin  added the comment:

At least fix this bug:

the error-handler object is not cached, it needs to be
looked up from a dict every time, which is very inefficient.

The code:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L81-L98

I will submit a PR at some point.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-30 Thread Dong-hee Na


Dong-hee Na  added the comment:

I am also +1 on Serhiy's opinion.

As I am Korean, (I don't know Japan or China environment)
I know that there still exist old Korean websites that use EUC-KR encoding.
But at least 2010s modern Korea website/application.
Most of the applications are built on UTF-8.

--
nosy: +corona10

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

In the Web application you need first to generate data (this may involve some 
network requests, IO operations, and some data transformations), then format 
the page, then encode it, and finally send it to client. I suppose that the 
encoding part is minor in comparison with others.

Also, as Inada-san noted, UTF-8 is more popular encoding in modern 
applications. It is also fast, so you may prefer UTF-8 if the performance of 
encoding is important to you.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Ma Lin


Ma Lin  added the comment:

> But how many new Python web application use CJK codec instead of UTF-8?

A CJK character usually takes 2-bytes in CJK encodings, but takes 3-bytes in 
UTF-8.

I tested a Chinese book:
in GBK: 853,025 bytes
in UTF-8: 1,267,523 bytes

For CJK content, UTF-8 is wasteful, maybe CJK encodings will not be eliminated.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Inada Naoki


Inada Naoki  added the comment:

But how many new Python web application use CJK codec instead of UTF-8?

--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Ma Lin


Ma Lin  added the comment:

IMO "xmlcharrefreplace" is useful for Web application.

For example, the page's charset is "gbk", then this statement can generate the 
bytes content easily & safely:

s.encode('gbk', 'xmlcharrefreplace')

Maybe some HTML-related frameworks use this way to escape characters, such as 
Sphinx [1].


Attached file `error_handers_fast_paths.txt` summarized all current 
error-handler fast-paths.

[1] Sphinx use 'xmlcharrefreplace' to escape
https://github.com/sphinx-doc/sphinx/blob/e65021fb9b0286f373f01dc19a5777e5eed49576/sphinx/builders/html/__init__.py#L1029

--
Added file: https://bugs.python.org/file49324/error_handers_fast_paths.txt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I am not even sure it was worth to add fast path for "xmlcharrefreplace". 
"surrogateescape" and "surrogatepass" are most likely used in performance 
critical cases. It is also easy to add support of "ignore" and "replace". 
"strict" raises an exception in any case, and "backslashreplace", 
"xmlcharrefreplace" and "namereplace" are too complex and used in cases when 
coding time is not dominant (error reporting, debugging, formatting complex 
documents).

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41330] Inefficient error-handle for CJK encodings

2020-07-17 Thread Ma Lin


New submission from Ma Lin :

CJK encode/decode functions only have three error-handler fast-paths:
replace
ignore
strict  
See the code: [1][2]

If use other built-in error-handlers, need to get the error-handler object, and 
call it with an Unicode Exception argument. See the code: [3]

But the error-handler object is not cached, it needs to be looked up from a 
dict every time, which is very inefficient.


Another possible optimization is to write fast-path for common error-handlers, 
Python has these built-in error-handlers:

strict
replace
ignore
backslashreplace
xmlcharrefreplace
namereplace
surrogateescape
surrogatepass (only for utf-8/utf-16/utf-32 family)

For example, maybe `xmlcharrefreplace` is heavily used in Web application, it 
can be implemented as a fast-path, so that no need to call the error-handler 
object every time.
Just like the `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap` [4].

[1] encode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L192

[2] decode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L347

[3] `call_error_callback` function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L82

[4] `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap`:
https://github.com/python/cpython/blob/v3.9.0b4/Objects/unicodeobject.c#L8662

--
components: Unicode
messages: 373871
nosy: ezio.melotti, malin, vstinner
priority: normal
severity: normal
status: open
title: Inefficient error-handle for CJK encodings
type: performance
versions: Python 3.10

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com