[issue18059] Add multibyte encoding support to pyexpat

2017-03-27 Thread Walter Dörwald

Walter Dörwald added the comment:

This looks to me like a limited reimplementation of the codec machinery. Why 
not use incremental codecs as a preprocessor? Would this be to slow?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2017-03-25 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
versions: +Python 3.7 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2017-03-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Marc-Andre, there are at least two issues about supporting East Asian encodings 
(issue13612 and issue15877). I think this means that that encodings are used in 
XML in wild. Current support of encodings (8-bit + UTF-8 + UTF-16) is enough 
for my needs, but I never have deal with East Asian languages.

Currently the CodecInfo object has an optional flag _is_text_encoding. I think 
we can add more private attributes (flags and precomputed tables) for using 
with the expat parser. If they are not set (third-party encodings) the current 
autodetection code can be used as a fallback.

--
nosy: +ncoghlan

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-11-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

If anybody is interested in support of multibyte encodings in XML parser, it is 
time to make a review.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

I'm not sure that multibyte encodings other than UTF-8 are used in the world. 
I'm not convinced that we should support them. If the changes are small, it's 
maybe not a bad thing. Do you know which applications use such codecs?

pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs 
can be used with your patch? A whitelist of multibyte codecs may be less 
reliable. What do you think?

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-11-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 I'm not sure that multibyte encodings other than UTF-8 are used in the world.

I don't use any of them but I heard some of them are still widely used.

This issue was provoked by issue13612. See also related issue15877.

 pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs 
 can be used with your patch?

All codecs which can be supported by expat.


   1. Every ASCII character that can appear in a well-formed XML document,
  other than the characters

  $@\^`{}~

  must be represented by a single byte, and that byte must be the
  same byte that represents that character in ASCII.

   2. No character may require more than 4 bytes to encode.

   3. All characters encoded must have Unicode scalar values =
  0x, (i.e., characters that would be encoded by surrogates in
  UTF-16 are  not allowed).  Note that this restriction doesn't
  apply to the built-in support for UTF-8 and UTF-16.

   4. No Unicode character may be encoded by more than one distinct
  sequence of bytes.


14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, 
cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, 
shift-jis-2004, shift-jisx0213.

 A whitelist of multibyte codecs may be less reliable. What do you think?

pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of 
supported encodings with minimal required tables.

pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat 
criteria and builds all needed data at first access (tens kilobytes). After 
heavy start it works much faster than previous patch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-11-22 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 22.11.2013 23:03, STINNER Victor wrote:
 
 I'm not sure that multibyte encodings other than UTF-8 are used in the world. 
 I'm not convinced that we should support them. If the changes are small, it's 
 maybe not a bad thing. Do you know which applications use such codecs?

I'm not sure what you mean with multibyte encodings. There's UTF-16 which
is a popular 2-byte encoding and then there are a whole lot of variable
length encodings such as UTF-8 and many of the Asian codecs in the stdlib.

While you see those used a lot for text, I'm not sure whether the
same is true for XML documents, where UTF-8 is the standard,
but other encodings can be specified if needed.

Serhiy: Apart from this being a nice-to-have feature, where do you see
the practical use ?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-09-14 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

1) Expat itself responsible for this guard. It has all necessary information 
and provides an input of required size for custom converter.

2) Yes, this is a problem. I'm working on another approach, when full encoding 
table built at first request for the encoding (and than cache it). It makes 
decoding individual characters fast, but requires about 0.5 sec for 
initialization. Is such approach more suitable?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-09-14 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is a totally rewritten patch, which builds decoding table at first request 
for encoding and save it in the cache. Decoding should be very fast.

Do you have large testing XML files with multibyte encodings? Could you please 
measure the time of parsing this files and for comparision the time of parsing 
this files encoded with utf-8 and utf-16?

--
Added file: http://bugs.python.org/file31758/pyexpat_multibyte_encodings.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-09-14 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: http://bugs.python.org/file31758/pyexpat_multibyte_encodings.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-09-14 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file31759/pyexpat_multibyte_encodings_5.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-09-13 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +scoder

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-09-13 Thread Eli Bendersky

Changes by Eli Bendersky eli...@gmail.com:


--
nosy:  -eli.bendersky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-09-13 Thread Stefan Behnel

Stefan Behnel added the comment:

I don't think I have my head deep enough in the encodings implementation to say 
that this is the correct/best way to do it, but the patch looks mostly 
reasonable to me and would be a helpful addition.

I have two comments on the pyexpat_encoding_convert() function:

1) I can't see a safe-guard against reading beyond the data buffer. What if s 
already points to the last byte and we are trying to read two or three bytes to 
decode them? I wouldn't be surprised to see that this kind of input can be 
crafted.

2) Creating a throw-away Unicode object through a named decoder looks like a 
huge overhead for decoding two bytes. It might be considered an optimisation to 
change that, but if you are really trying to parse a longer XML document with 
lots of Japanese text in it (i.e. if you actually *need* this feature), it will 
most likely end up being way too slow to make any real use of it.

I think that both points should be addressed before this gets added.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-26 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: 
http://bugs.python.org/file30378/pyexpat_multibyte_encodings_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-26 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file30380/pyexpat_multibyte_encodings_3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-26 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Patch updated. Yet some tests added and yet some bugs fixed.

--
Added file: http://bugs.python.org/file30381/pyexpat_multibyte_encodings_4.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-26 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: 
http://bugs.python.org/file30380/pyexpat_multibyte_encodings_3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

It is possible to add the support of most multibyte encodings to pyexpat.

There are several ways to do this:

1. Generate maps with a special script and add generated file to repository. 
After adding or updating a multibyte encoding this file should be regenerated.

2. Generate maps on fly. It requires more time for first use of the encoding, 
but allows support of arbitrary encoding which compatible with expat.

--
components: Extension Modules, XML
files: expat_encodings.py
messages: 189989
nosy: doerwalter, eli.bendersky, lemburg, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Add multibyte encoding support to pyexpat
type: enhancement
versions: Python 3.4
Added file: http://bugs.python.org/file30368/expat_encodings.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc added the comment:

I guess GB18030 can't be supported at all?

--
nosy: +amaury.forgeotdarc

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is a patch which implements first way.

Yes, looks as followed encodings could not be supported at all: euc-kr, 
gb18030, iso2022-kr, utf-7, cp037, cp424, cp500, cp864, cp875, cp1026, cp1140, 
utf_32, utf_32_be, utf_32_le.

--
keywords: +patch
Added file: http://bugs.python.org/file30373/pyexpat_multibyte_encodings.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc added the comment:

Then you should also remove the Make it as simple as possible comment :-/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

It is still simple enough.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Patch updated. Fixed an error in the encodings generator and added additional 
compatibility check for 8-bit encodings in PyUnknownEncodingHandler().

Feel free to bikesheed the encodings generator.

--
Added file: http://bugs.python.org/file30378/pyexpat_multibyte_encodings_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: http://bugs.python.org/file30373/pyexpat_multibyte_encodings.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: http://bugs.python.org/file30368/expat_encodings.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18059] Add multibyte encoding support to pyexpat

2013-05-25 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
stage:  - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18059
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com