[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-11 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: There are two causes: 1. UTF-16 and UTF-32 are based on 2- and 4-bytes units. If the surrogateescape error handler will support UTF-16 and UTF-32, encoding could produce the data that can't be decoded back correctly. For example '\udcac \udcac' -

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-11 Thread tmp12342
tmp12342 added the comment: Serhiy, I understand the first reason, but https://docs.python.org/3/library/codecs.html says applicable to text encodings: [...] This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-10 Thread Martijn Pieters
Martijn Pieters added the comment: I don't understand why encoding with `surrogateescape` isn't supported still; is it the fact that a surrogate would have to produce *single bytes* rather than double? E.g. b'\x80' - '\udc80' - b'\x80' doesn't work because that would mean the UTF-16 and

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-21 Thread STINNER Victor
STINNER Victor added the comment: Thanks Ezio and Serhiy for having fix UTF-16 and UTF-32 codecs! -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Roundup Robot
Roundup Robot added the comment: New changeset 0d9624f2ff43 by Serhiy Storchaka in branch 'default': Issue #12892: The utf-16* and utf-32* codecs now reject (lone) surrogates. http://hg.python.org/cpython/rev/0d9624f2ff43 -- nosy: +python-dev ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Ezio have approved the patch and I have committed it. Thank you Victor and Kang-Hao for your patches. Thanks all for the reviews. -- resolution: - fixed stage: patch review - committed/rejected status: open - closed

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Roundup Robot
Roundup Robot added the comment: New changeset 130597102dac by Serhiy Storchaka in branch 'default': Remove dead code committed in issue #12892. http://hg.python.org/cpython/rev/130597102dac -- ___ Python tracker rep...@bugs.python.org

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-18 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: -- assignee: ezio.melotti - serhiy.storchaka ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Changed the documentation as was discussed with Ezio on IRC. Ezio, do you want commit this patch? Feel free to reword the documentation if you are feeling be better. -- Added file: http://bugs.python.org/file32201/utf_16_32_surrogates_6.patch

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: Removed file: http://bugs.python.org/file32201/utf_16_32_surrogates_6.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file32202/utf_16_32_surrogates_6.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-11 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Updated patch addresses Victor's comments on Rietveld. Thank you Victor. The surrogatepass error handler now works with different spellings of encodings (utf_32le, UTF-32-LE, etc). I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-10 Thread STINNER Victor
STINNER Victor added the comment: I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected. b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore') '[]' b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace') '[�]' b'[\x00\x80\xdc]\x00'.decode('utf-16-le',

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-10 Thread STINNER Victor
STINNER Victor added the comment: Could you please review this not so simple patch instead? I did a first review of your code on rietveld. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower. -- nosy: +pitrou ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 10:46, Antoine Pitrou wrote: utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower. It's the default encoding for Unicode text files and APIs on Windows, so I'd say it *is* widely used :-)

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: On 08.10.2013 10:46, Antoine Pitrou wrote: utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower. It's the default encoding for Unicode text files and APIs on Windows, so I'd say it *is* widely used :-) I've never seen any

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 11:03, Antoine Pitrou wrote: utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower. It's the default encoding for Unicode text files and APIs on Windows, so I'd say it *is* widely used :-) I've never seen any

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: MS Notepad and MS Office save Unicode text files in UTF-16-LE, unless you explicitly specify UTF-8, just like many other Windows applications that support Unicode text files: I'd be curious to know if people actually edit *text files* using Microsoft Word

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: UTF-16 codec still fast enough. Let first make it correct and then will try optimize it. I have an idea how restore 3.3 performance (if it worth, the code already complicated enough). The converting to/from wchar_t* uses different code. --

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 11:33, Antoine Pitrou wrote: Antoine Pitrou added the comment: MS Notepad and MS Office save Unicode text files in UTF-16-LE, unless you explicitly specify UTF-8, just like many other Windows applications that support Unicode text

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I repeat myself. Even with the patch, UTF-16 codec is faster than UTF-8 codec (except ASCII-only data). This is fastest Unicode codec in Python (perhaps UTF-32 can be made faster, but this is another issue). The real question is: Can the UTF-16/32 codecs

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou
Antoine Pitrou added the comment: UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible in Python to not create performance problems when converting between platform Unicode data and the internal formats used in Python. As fast as possible is a platonic dream. They only need to be

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 11:42, Serhiy Storchaka wrote: UTF-16 codec still fast enough. Let first make it correct and then will try optimize it. I have an idea how restore 3.3 performance (if it worth, the code already complicated enough). That's a good plan

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 08.10.2013 12:30, Antoine Pitrou wrote: UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible in Python to not create performance problems when converting between platform Unicode data and the internal formats used in Python. As fast as

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread STINNER Victor
STINNER Victor added the comment: I don't think that performances on a microbenchmark is the good question. The good question is: does Python conform to Unicode? The answer is simple and explicit: no. Encoding lone surrogates may lead to bugs and even security vulnerabilities. Please open a new

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is my idea: http://permalink.gmane.org/gmane.comp.python.ideas/23521. I see that a discussion about how fast UTF-16 codec should be already larger than discussion about patches. Could you please review this not so simple patch instead? Yet one help

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Martin v . Löwis
Martin v. Löwis added the comment: Marc-Andre: please don't confuse use in major operating systems with major use in operating systems. I agree with Antoine that UTF-16 isn't widely used on Windows, despite notepad and Office supporting it. Most users on Windows using notepad continue to use

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-07 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Updated whatsnew and Misc/ files. -- Added file: http://bugs.python.org/file31984/utf_16_32_surrogates_4.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-01 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Could you please make a review Ezio? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___ ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is a patch which combines both Kang-Hao's patches, synchronized with tip, fixed and optimized. Unfortunately even optimized this patch slowdown encoding/decoding some data. Here are some benchmark results (benchmarking tools are here:

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: You should be able to squeeze out some extra cycles by avoiding the bit calculations using a simple range check for ch = 0xd800: +# if STRINGLIB_MAX_CHAR = 0xd800 +if (((ch1 ^ 0xd800) + (ch1 ^ 0xd800) + (ch1 ^

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than previous wrong patch. -- Added file: http://bugs.python.org/file31557/utf_16_32_surrogates_3.patch

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: Removed file: http://bugs.python.org/file31555/utf_16_32_surrogates_2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: On 02.09.2013 18:56, Serhiy Storchaka wrote: Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than previous wrong patch. I think it would be faster to do this in

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: No, it isn't faster. I tested this variant, it is 1.5x slower. And simple range checking actually is slower. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka storch...@gmail.com: -- stage: test needed - patch review versions: +Python 3.4 -Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-04-24 Thread Serhiy Storchaka
Serhiy Storchaka storch...@gmail.com added the comment: * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns \x00 instead of A for some reason) The patch for issue14579 fixes this in Python 3.2. The patch for

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-31 Thread Kang-Hao (Kenny) Lu
Kang-Hao (Kenny) Lu kennyl...@csail.mit.edu added the comment: The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately: * make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-30 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Thanks for the patch! * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns \x00 instead of A for some reason) This should probably be done on a separate patch

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-29 Thread Kang-Hao (Kenny) Lu
Kang-Hao (Kenny) Lu kennyl...@csail.mit.edu added the comment: Attached patch does the following beyond what the patch from haypo does: * call the error handler * reject 0xd800~0xdfff when decoding utf-32 The followings are on my TODO list, although this patch doesn't depend on any of

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: Python 3.3 has a strange behaviour: '\uDBFF\uDFFF'.encode('utf-16-le').decode('utf-16-le') '\U0010' '\U0010'.encode('utf-16-le').decode('utf-16-le') '\U0010' I would expect text.decode(encoding).encode(encoding)==text

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: Patch rejecting surrogates in UTF-16 and UTF-32 encoders. I don't think that Python 2.7 and 3.2 should be changed in a minor version. -- dependencies: -Refactor code using unicode_encode_call_errorhandler() in

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: Hum, my patch doesn't call the error handler. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892 ___

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-10-25 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- dependencies: +Refactor code using unicode_encode_call_errorhandler() in unicodeobject.c ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue12892

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-09-04 Thread Ezio Melotti
New submission from Ezio Melotti ezio.melo...@gmail.com: From Chapter 03 of the Unicode Standard 6[0], D91: • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+..U+D7FF and U+E000..U+ to a single unsigned 16-bit code unit with the

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-09-04 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: New submission from Ezio Melotti ezio.melo...@gmail.com: From Chapter 03 of the Unicode Standard 6[0], D91: • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges