[issue41330] Inefficient error-handle for CJK encodings

2020-08-03 Thread STINNER Victor
STINNER Victor added the comment: (off topic) > If nothing happens, I also would like to write a zstd module for stdlib > before the end of the year, but I dare not promise this. I suggest you to publish it on PyPI. Once it will be mature, you can propose it on python-ideas. Last time

[issue41330] Inefficient error-handle for CJK encodings

2020-08-03 Thread Ma Lin
Ma Lin added the comment: I'm working on issue41265. If nothing happens, I also would like to write a zstd module for stdlib before the end of the year, but I dare not promise this. If anyone wants to work on this issue, very grateful. -- ___

[issue41330] Inefficient error-handle for CJK encodings

2020-08-03 Thread STINNER Victor
STINNER Victor added the comment: Since CJK codecs have been implemented, unicodeobject.c got multiple optimizations: * _PyUnicodeWriter for decoder: API designed with efficiency and PEP 393 (compact string) in mind * _PyBytesWriter for encoders: in short, API to overallocate a buffer *

[issue41330] Inefficient error-handle for CJK encodings

2020-07-31 Thread Ma Lin
Ma Lin added the comment: At least fix this bug: the error-handler object is not cached, it needs to be looked up from a dict every time, which is very inefficient. The code: https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L81-L98 I will submit a

[issue41330] Inefficient error-handle for CJK encodings

2020-07-30 Thread Dong-hee Na
Dong-hee Na added the comment: I am also +1 on Serhiy's opinion. As I am Korean, (I don't know Japan or China environment) I know that there still exist old Korean websites that use EUC-KR encoding. But at least 2010s modern Korea website/application. Most of the applications are built on

[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: In the Web application you need first to generate data (this may involve some network requests, IO operations, and some data transformations), then format the page, then encode it, and finally send it to client. I suppose that the encoding part is minor

[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Ma Lin
Ma Lin added the comment: > But how many new Python web application use CJK codec instead of UTF-8? A CJK character usually takes 2-bytes in CJK encodings, but takes 3-bytes in UTF-8. I tested a Chinese book: in GBK: 853,025 bytes in UTF-8: 1,267,523 bytes For CJK content, UTF-8 is

[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Inada Naoki
Inada Naoki added the comment: But how many new Python web application use CJK codec instead of UTF-8? -- nosy: +inada.naoki ___ Python tracker ___

[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Ma Lin
Ma Lin added the comment: IMO "xmlcharrefreplace" is useful for Web application. For example, the page's charset is "gbk", then this statement can generate the bytes content easily & safely: s.encode('gbk', 'xmlcharrefreplace') Maybe some HTML-related frameworks use this way to escape

[issue41330] Inefficient error-handle for CJK encodings

2020-07-18 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I am not even sure it was worth to add fast path for "xmlcharrefreplace". "surrogateescape" and "surrogatepass" are most likely used in performance critical cases. It is also easy to add support of "ignore" and "replace". "strict" raises an exception in

[issue41330] Inefficient error-handle for CJK encodings

2020-07-17 Thread Ma Lin
New submission from Ma Lin : CJK encode/decode functions only have three error-handler fast-paths: replace ignore strict See the code: [1][2] If use other built-in error-handlers, need to get the error-handler object, and call it with an Unicode Exception argument. See the code: