[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

There are two causes:

1. UTF-16 and UTF-32 are based on 2- and 4-bytes units. If the surrogateescape 
error handler will support UTF-16 and UTF-32, encoding could produce the data 
that can't be decoded back correctly. For example '\udcac \udcac' - 
b'\xac\x20\x00\xac' - '\u20ac\uac20' == '€가'.

2. ASCII bytes (0x00-0x80) can't be escaped with surrogateescape. UTF-16 and 
UTF-32 data can contain illegal ASCII bytes (b'\xD8\x00' in UTF-16-BE, b'abcd' 
in UTF-32). For the same reason surrogateescape is not compatible with UTF-7 
and CP037.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-11 Thread tmp12342

tmp12342 added the comment:

Serhiy, I understand the first reason, but 
https://docs.python.org/3/library/codecs.html says
 applicable to text encodings:
 [...]
 This code will then be turned back into the same byte when the 
 'surrogateescape' error handler is used when encoding the data.
Shouldn't it be corrected? Text encoding is defined as A codec which encodes 
Unicode strings to bytes.


And about second one, could you explain a bit more? I mean, I don't know how to 
interpret it.

You say b'\xD8\x00' are invalid ASCII bytes, but from these two only 0xD8 is 
invalid. Also, we are talking about encoding here, str - bytes, so who cares 
are resulting bytes ASCII compatible or not?

--
nosy: +tmp12342

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2015-08-10 Thread Martijn Pieters

Martijn Pieters added the comment:

I don't understand why encoding with `surrogateescape` isn't supported still; 
is it the fact that a surrogate would have to produce *single bytes* rather 
than double? E.g. b'\x80' - '\udc80' - b'\x80' doesn't work because that 
would mean the UTF-16 and UTF-32 codec could then end up producing an odd 
number of bytes?

--
nosy: +mjpieters

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-21 Thread STINNER Victor

STINNER Victor added the comment:

Thanks Ezio and Serhiy for having fix UTF-16 and UTF-32 codecs!

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 0d9624f2ff43 by Serhiy Storchaka in branch 'default':
Issue #12892: The utf-16* and utf-32* codecs now reject (lone) surrogates.
http://hg.python.org/cpython/rev/0d9624f2ff43

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Ezio have approved the patch and I have committed it.

Thank you Victor and Kang-Hao for your patches. Thanks all for the reviews.

--
resolution:  - fixed
stage: patch review - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-19 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 130597102dac by Serhiy Storchaka in branch 'default':
Remove dead code committed in issue #12892.
http://hg.python.org/cpython/rev/130597102dac

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-11-18 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
assignee: ezio.melotti - serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Changed the documentation as was discussed with Ezio on IRC.

Ezio, do you want commit this patch? Feel free to reword the documentation if 
you are feeling be better.

--
Added file: http://bugs.python.org/file32201/utf_16_32_surrogates_6.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: http://bugs.python.org/file32201/utf_16_32_surrogates_6.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-18 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file32202/utf_16_32_surrogates_6.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Updated patch addresses Victor's comments on Rietveld. Thank you Victor. The 
surrogatepass error handler now works with different spellings of encodings 
(utf_32le, UTF-32-LE, etc).

 I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does 
 not work as expected.

Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. 
First, it can't represent the result of decoding b'\x00\xd8' from utf-16-le or 
b'ABCD' from utf-32*. This problem is worth separated issue (or even PEP) and 
discussion on Python-Dev.

--
Added file: http://bugs.python.org/file32047/utf_16_32_surrogates_5.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-10 Thread STINNER Victor

STINNER Victor added the comment:

I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not 
work as expected.

 b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
 b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
 b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\u'

= I expected '[\udc80\udcdc]'.

With a decoder, surrogateescape does not work neither:

 '[\uDC80]'.encode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\udc80' in 
position 1: surrogates not allowed

Using the PEP 383, I expect that data.decode(encoding, 'surrogateescape') does 
never fail, data.decode(encoding, 'surrogateescape').encode(encoding, 
'surrogateescape') should give data.

--

With UTF-16, there is a corner case:

 b'[\x00\x00'.decode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/haypo/prog/python/default/Lib/encodings/utf_16_le.py, line 16, 
in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 2: 
truncated data
 b'[\x00\x80'.decode('utf-16-le', 'surrogateescape')
'[\udc80'

The incomplete sequence b'\x00' raises a decoder error, wheras b'\x80' does 
not. Should we extend the PEP 383 to bytes in range [0; 127]? Or should we keep 
this behaviour?

Sorry, this question is unrelated to this issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-10 Thread STINNER Victor

STINNER Victor added the comment:

 Could you please review this not so simple patch instead?

I did a first review of your code on rietveld.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou

Antoine Pitrou added the comment:

utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower.

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 08.10.2013 10:46, Antoine Pitrou wrote:
 
 utf-16 isn't that widely used, so it's probably fine if it becomes a bit 
 slower.

It's the default encoding for Unicode text files and APIs on Windows,
so I'd say it *is* widely used :-)

http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_systems_and_environments

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 On 08.10.2013 10:46, Antoine Pitrou wrote:
  
  utf-16 isn't that widely used, so it's probably fine if it becomes
  a bit slower.
 
 It's the default encoding for Unicode text files and APIs on Windows,
 so I'd say it *is* widely used :-)

I've never seen any UTF-16 text files. Do you have other data?

APIs are irrelevant. You only pass very small strings to then (e.g.
file paths).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 08.10.2013 11:03, Antoine Pitrou wrote:
 
 utf-16 isn't that widely used, so it's probably fine if it becomes
 a bit slower.

 It's the default encoding for Unicode text files and APIs on Windows,
 so I'd say it *is* widely used :-)
 
 I've never seen any UTF-16 text files. Do you have other data?

See the link I posted.

MS Notepad and MS Office save Unicode text files in UTF-16-LE,
unless you explicitly specify UTF-8, just like many other Windows
applications that support Unicode text files:

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx
http://superuser.com/questions/294219/what-are-the-differences-between-linux-and-windows-txt-files-unicode-encoding

This is simply due to the fact that MS introduced Unicode plain
text files as UTF-16-LE files and only later added the possibility
to also use UTF-8 with BOM versions.

 APIs are irrelevant. You only pass very small strings to then (e.g.
 file paths).

You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
is all around you when working on Windows, not only in the OS APIs,
but also in most other Unicode APIs you find on Windows:

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 MS Notepad and MS Office save Unicode text files in UTF-16-LE,
 unless you explicitly specify UTF-8, just like many other Windows
 applications that support Unicode text files:

I'd be curious to know if people actually edit *text files* using
Microsoft Word (rather than Word documents).
Same for Notepad, which is much too poor to edit something else
than a 10-line configuration file.

 You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
 is all around you when working on Windows, not only in the OS APIs,
 but also in most other Unicode APIs you find on Windows:

Still, unless those APIs get passed rather large strings, the performance
different should be irrelevant IMHO. We're talking about using those APIs
from Python, not from a raw optimized C program.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

UTF-16 codec still fast enough. Let first make it correct and then will try 
optimize it. I have an idea how restore 3.3 performance (if it worth, the code 
already complicated enough).

The converting to/from wchar_t* uses different code.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 08.10.2013 11:33, Antoine Pitrou wrote:
 
 Antoine Pitrou added the comment:
 
 MS Notepad and MS Office save Unicode text files in UTF-16-LE,
 unless you explicitly specify UTF-8, just like many other Windows
 applications that support Unicode text files:
 
 I'd be curious to know if people actually edit *text files* using
 Microsoft Word (rather than Word documents).
 Same for Notepad, which is much too poor to edit something else
 than a 10-line configuration file.

The question is not so much which program they use for editing.
The format Unicode text file is defined as UTF-16-LE on
Windows (see the links I posted).

 You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
 is all around you when working on Windows, not only in the OS APIs,
 but also in most other Unicode APIs you find on Windows:
 
 Still, unless those APIs get passed rather large strings, the performance
 different should be irrelevant IMHO. We're talking about using those APIs
 from Python, not from a raw optimized C program.

Antoine, I'm just pointing out that your statement that UTF-16
is not widely used may apply to the Unix world, but
it doesn't apply to Windows. Java also uses UTF-16
internally and makes this available via JNI as jchar*.

The APIs on those platforms are used from Python (the interpreter
and also by extensions) and do use the UTF-16 Python codec to
convert the data to Python Unicode/string objects, so the fact
that UTF-16 is used widely on some of the more popular
platforms does matter.

UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
in Python to not create performance problems when converting
between platform Unicode data and the internal formats
used in Python.

The real question is: Can the UTF-16/32 codecs be made fast
while still detecting lone surrogates ? Not whether UTF-16
is widely used or not.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I repeat myself. Even with the patch, UTF-16 codec is faster than UTF-8 codec 
(except ASCII-only data). This is fastest Unicode codec in Python (perhaps 
UTF-32 can be made faster, but this is another issue).

 The real question is: Can the UTF-16/32 codecs be made fast
 while still detecting lone surrogates ? Not whether UTF-16
 is widely used or not.

Yes, they can. But let defer this to other issues.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
 in Python to not create performance problems when converting
 between platform Unicode data and the internal formats
 used in Python.

As fast as possible is a platonic dream.
They only need to be fast enough not to be bottlenecks.
If you know of a *Python* workload where UTF-16 decoding is the
bottleneck, I'd like to know about it :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 08.10.2013 11:42, Serhiy Storchaka wrote:
 
 UTF-16 codec still fast enough. Let first make it correct and then will try 
 optimize it. I have an idea how restore 3.3 performance (if it worth, the 
 code already complicated enough).

That's a good plan :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 08.10.2013 12:30, Antoine Pitrou wrote:
 
 UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
 in Python to not create performance problems when converting
 between platform Unicode data and the internal formats
 used in Python.
 
 As fast as possible is a platonic dream.
 They only need to be fast enough not to be bottlenecks.

No, they need to be as fast as possible, without sacrificing
correctness.

This has always been our guideline for codec implementations
and string methods. As a result, our implementations are some
of the best out there.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread STINNER Victor

STINNER Victor added the comment:

I don't think that performances on a microbenchmark is the good question.
The good question is: does Python conform to Unicode? The answer is simple
and explicit: no. Encoding lone surrogates may lead to bugs and even
security vulnerabilities.

Please open a new performance issue after fixing this one if you have
another patch improving performances.

I didn't read the patch yet, but strict, surrogatepass and surrogateescape
error handlers must be checked.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is my idea: http://permalink.gmane.org/gmane.comp.python.ideas/23521.

I see that a discussion about how fast UTF-16 codec should be already larger 
than discussion about patches. Could you please review this not so simple patch 
instead?

Yet one help which I need is writing a note in Porting to Python 3.4 section 
in Doc/whatsnew/3.4.rst.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-08 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Marc-Andre: please don't confuse use in major operating systems with major 
use in operating systems.  I agree with Antoine that UTF-16 isn't widely used 
on Windows, despite notepad and Office supporting it. Most users on Windows 
using notepad continue to use the ANSI code page, most users of Word use Word 
files (instead of plain text).

Also, wchar_t on Windows isn't *really* UTF-16. Many APIs support lone 
surrogates just fine; they really are UCS-2 instead (e.g. the file system 
APIs). Only starting with Vista, MultiByteToWideChar will complain about lone 
surrogates.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-07 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Updated whatsnew and Misc/ files.

--
Added file: http://bugs.python.org/file31984/utf_16_32_surrogates_4.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-10-01 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Could you please make a review Ezio?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is a patch which combines both Kang-Hao's patches, synchronized with tip, 
fixed and optimized.

Unfortunately even optimized this patch slowdown encoding/decoding some data. 
Here are some benchmark results (benchmarking tools are here: 
https://bitbucket.org/storchaka/cpython-stuff/src/default/bench).

3.3  3.4  3.4
 unpatchedpatched

969 (+12%)   978 (+11%)   1087   encode  utf-16le  'A'*1
2453 (-62%)  2356 (-61%)  923encode  utf-16le  '\u0100'*1
222 (+12%)   224 (+11%)   249encode  utf-16le'\U0001'+'\u0100'*
784 (+6%)791 (+5%)831encode  utf-16be  'A'*1
750 (-4%)752 (-4%)719encode  utf-16be  '\u0100'*1
233 (+2%)235 (+1%)238encode  utf-16be'\U0001'+'\u0100'*

531 (-7%)545 (-9%)494encode  utf-32le  'A'*1
383 (-38%)   385 (-38%)   239encode  utf-32le  '\u0100'*1
324 (-24%)   325 (-25%)   245encode  utf-32le'\U0001'+'\u0100'*
544 (-10%)   545 (-10%)   492encode  utf-32be  'A'*1
384 (-38%)   384 (-38%)   239encode  utf-32be  '\u0100'*1
325 (-25%)   325 (-25%)   245encode  utf-32be'\U0001'+'\u0100'*

682 (+5%)679 (+5%)715decode  utf-16le  'A'*1
607 (+1%)610 (+1%)614decode  utf-16le  '\u0100'*1
550 (+1%)554 (+0%)556decode  utf-16le'\U0001'+'\u0100'*
609 (+0%)600 (+2%)610decode  utf-16be  'A'*1
464 (+1%)466 (+1%)470decode  utf-16be  '\u0100'*1
432 (+1%)431 (+1%)435decode  utf-16be'\U0001'+'\u0100'*

103 (+272%)  374 (+2%)383decode  utf-32le  'A'*1
91 (+264%)   390 (-15%)   331decode  utf-32le  '\u0100'*1
90 (+257%)   393 (-18%)   321decode  utf-32le'\U0001'+'\u0100'*
103 (+269%)  393 (-3%)380decode  utf-32be  'A'*1
91 (+263%)   406 (-19%)   330decode  utf-32be  '\u0100'*1
90 (+257%)   393 (-18%)   321decode  utf-32be'\U0001'+'\u0100'*

Performance of utf-16 decoding is not changed. utf-16 encoder is 2.5 times 
slowed for UCS2 data (it was just memcpy) but still 3+ times faster than 2.7 
and 3.2 (issue15026). Due to additional optimization it now even slightly 
faster for some other data. There is a patch for speed up UTF-32 encoding 
(issue15027), it should help to compensate it's performance degradation. UTF-32 
decoder already 3-4 times faster than in 3.3 (issue14625).

I don't see performance objection against this patch.

--
Added file: http://bugs.python.org/file31555/utf_16_32_surrogates_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

You should be able to squeeze out some extra cycles by
avoiding the bit calculations using a simple range check
for ch = 0xd800:

+# if STRINGLIB_MAX_CHAR = 0xd800
+if (((ch1 ^ 0xd800) 
+ (ch1 ^ 0xd800) 
+ (ch1 ^ 0xd800) 
+ (ch1 ^ 0xd800)  0xf800) == 0)
+break;
+# endif

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. Unfortunately 
it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than previous wrong 
patch.

--
Added file: http://bugs.python.org/file31557/utf_16_32_surrogates_3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Removed file: http://bugs.python.org/file31555/utf_16_32_surrogates_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 02.09.2013 18:56, Serhiy Storchaka wrote:
 
 Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. 
 Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than 
 previous wrong patch.

I think it would be faster to do this in two steps:

1. check the ranges of the input

2. do a memcpy() if there are no lone surrogates

Part 1 can be further optimized by adding a simple range
check (ch = 0xd800).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2013-09-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

No, it isn't faster. I tested this variant, it is 1.5x slower.

And simple range checking actually is slower.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-11-04 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
stage: test needed - patch review
versions: +Python 3.4 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-04-24 Thread Serhiy Storchaka

Serhiy Storchaka storch...@gmail.com added the comment:

  * fix an error in the error handler for utf-16-le. (In, Python3.2 
 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns \x00 instead of 
 A for some reason)

The patch for issue14579 fixes this in Python 3.2.

The patch for issue14624 fixes this in Python 3.3.

--
nosy: +storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-31 Thread Kang-Hao (Kenny) Lu

Kang-Hao (Kenny) Lu kennyl...@csail.mit.edu added the comment:

 The followings are on my TODO list, although this patch doesn't depend
 on any of these and can be reviewed and landed separately:
  * make the surrogatepass error handler work for utf-16 and utf-32. (I
should be able to finish this by today)

Unfortunately this took longer than I thought but here comes the patch.

  * fix an error in the error handler for utf-16-le. (In, Python3.2 
 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns \x00 
 instead of A for some reason)

 This should probably be done on a separate patch that will be applied
 to 3.2/3.3 (assuming that it can go to 3.2).  Rejecting surrogates will
 go in 3.3 only.  (Note that lot of Unicode-related code changed between
 3.2 and 3.3.)

This turns out to be just two liners so I fixed that on the way. I can create 
separate patch with separate test for 3.2 (certainly doable) and even for 3.3, 
but since the test is now part of test_lone_surrogates, I feel less willing to 
do that for 3.3.

You might notice the codec naming inconsistency (utf-16-be and utf16be for 
encoding and decoding respectively). I have filed issue #13913 for this.

Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the 
surrogatepass handler for non utf-* encodings). As long as we have that we 
can examine individual character instead...

In this patch, The encoding attribute for UnicodeDecodeException is now 
changed to return utf16(be|le) for utf-16. This is necessary info for 
surrogatepass to work although admittedly this is rather ugly. Any good idea? 
A new attribute for Unicode(Decode|Encode)Exception might be helpful but 
utf-16/32 are fairly uncommon encodings anyway and we should not add more 
burden for, say, utf-8.

 Should we really reject lone surrogates for UTF-7?

 No, I meant only UTF-8/16/32; UTF-7 is fine as is.

Good to know.

--
Added file: http://bugs.python.org/file24384/surrogatepass_for_utf-1632.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-30 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Thanks for the patch!

  * fix an error in the error handler for utf-16-le. (In, Python3.2 
 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns \x00 
 instead of A for some reason)

This should probably be done on a separate patch that will be applied to 
3.2/3.3 (assuming that it can go to 3.2).  Rejecting surrogates will go in 3.3 
only.  (Note that lot of Unicode-related code changed between 3.2 and 3.3.)

 Should we really reject lone surrogates for UTF-7?

No, I meant only UTF-8/16/32; UTF-7 is fine as is.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2012-01-29 Thread Kang-Hao (Kenny) Lu

Kang-Hao (Kenny) Lu kennyl...@csail.mit.edu added the comment:

Attached patch does the following beyond what the patch from haypo does:
  * call the error handler
  * reject 0xd800~0xdfff when decoding utf-32

The followings are on my TODO list, although this patch doesn't depend on any 
of these and can be reviewed and landed separately:
  * make the surrogatepass error handler work for utf-16 and utf-32. (I should 
be able to finish this by today)
  * fix an error in the error handler for utf-16-le. (In, Python3.2 
b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns \x00 instead of A 
for some reason)
  * make unicode_encode_call_errorhandler return bytes so that we can simplify 
this patch. (This arguably belongs to a separate bug so I'll file it when 
needed)

 All UTF codecs should reject lone surrogates in strict error mode,

Should we really reject lone surrogates for UTF-7? There's a test in 
test_codecs.py that tests \udc80 to be encoded b+3IA- (. Given that UTF-7 
is not really part of the Unicode Standard and it is more like a data 
encoding than a text encoding to me, I am not sure it's a good idea.

 but let them pass using the surrogatepass error handler (the UTF-8
 codec already does) and apply the usual error handling for ignore
 and replace.

For 'replace', the patch now emits b\x00? instead of b? so that UTF-16 
stream doesn't get corrupted. It is not usual and not matching

  # Implements the ``replace`` error handling: malformed data is replaced
  # with a suitable replacement character such as ``'?'`` in bytestrings 
  # and ``'\ufffd'`` in Unicode strings.

in the documentation. What do we do? Are there other encodings that are not 
ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think 
it would be better to use encoded U+fffd whenever possible and fall back to 
'?'. What do you think?

Some other self comments on my patch:
  * In the STORECHAR macro for utf-16 and utf-32, I change all instances of ch 
 0xFF to (unsigned char) ch. I don't have enough C knowledge to know if this 
is actually better or if this makes any difference at all.
  * The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's 
complexity comes from issue #8092 . Not sure if there are ways to simplify 
these. For example, are there suitable functions there so that we don't need to 
check integer overflow at these places?

--
nosy: +kennyluck
Added file: http://bugs.python.org/file24368/utf-1632_reject_surrogates.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Python 3.3 has a strange behaviour:

 '\uDBFF\uDFFF'.encode('utf-16-le').decode('utf-16-le')
'\U0010'
 '\U0010'.encode('utf-16-le').decode('utf-16-le')
'\U0010'

I would expect text.decode(encoding).encode(encoding)==text or an encode or 
decode error.

So I agree that the encoder should reject lone surogates.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Patch rejecting surrogates in UTF-16 and UTF-32 encoders.

I don't think that Python 2.7 and 3.2 should be changed in a minor version.

--
dependencies:  -Refactor code using unicode_encode_call_errorhandler() in 
unicodeobject.c
keywords: +patch
Added file: http://bugs.python.org/file23810/utf_16_32_surrogates.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-11-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Hum, my patch doesn't call the error handler.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-10-25 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
dependencies: +Refactor code using unicode_encode_call_errorhandler() in 
unicodeobject.c

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-09-04 Thread Ezio Melotti

New submission from Ezio Melotti ezio.melo...@gmail.com:

From Chapter 03 of the Unicode Standard 6[0], D91:

• UTF-16 encoding form: The Unicode encoding form that assigns each Unicode 
scalar value in the ranges U+..U+D7FF and U+E000..U+ to a single 
unsigned 16-bit code unit with the same numeric value as the Unicode scalar 
value, and that assigns each Unicode scalar value in the range 
U+1..U+10 to a surrogate pair, according to Table 3-5.
• Because surrogate code points are not Unicode scalar values, isolated UTF-16 
code units in the range 0xD800..0xDFFF are ill-formed.

I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and 
encode a non-BMP character using a  valid surrogate pair, but it should reject 
lone surrogates both during encoding and decoding.

On Python 3, the utf-16 codec can encode all the codepoints from U+ to 
U+10 (including (lone) surrogates), but it's not able to decode lone 
surrogates (not sure if this is by design or if it just fails because it 
expects another (missing) surrogate).

--

From Chapter 03 of the Unicode Standard 6[0], D90:

• UTF-32 encoding form: The Unicode encoding form that assigns each Unicode 
scalar value to a single unsigned 32-bit code unit with the same numeric value 
as the Unicode scalar value.
• Because surrogate code points are not included in the set of Unicode scalar 
values, UTF-32 code units in the range 0xD800..0xDFFF are ill-formed.

I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both 
during encoding and during decoding.

On Python 3, the utf-32 codec can encode and decode all the codepoints from 
U+ to U+10 (including surrogates).

--

I think that:
  * this should be fixed in 3.3;
  * it's a bug, so the fix /might/ be backported to 3.2.  Hoverver it's also a 
fairly big change in behavior, so it might be better to leave it for 3.3 only;
  * it's better to leave 2.7 alone, even the utf-8 codec is broken there;
  * the surrogatepass error handler should work with the utf-16 and utf-32 
codecs too.


Note that this has been already reported in #3672, but eventually only the 
utf-8 codec was fixed.

[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

--
assignee: ezio.melotti
components: Unicode
messages: 143490
nosy: ezio.melotti, gvanrossum, haypo, lemburg, loewis, tchrist
priority: high
severity: normal
stage: test needed
status: open
title: UTF-16 and UTF-32 codecs should reject (lone) surrogates
type: behavior
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

2011-09-04 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Ezio Melotti wrote:
 
 New submission from Ezio Melotti ezio.melo...@gmail.com:
 
From Chapter 03 of the Unicode Standard 6[0], D91:
 
 • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode 
 scalar value in the ranges U+..U+D7FF and U+E000..U+ to a single 
 unsigned 16-bit code unit with the same numeric value as the Unicode scalar 
 value, and that assigns each Unicode scalar value in the range 
 U+1..U+10 to a surrogate pair, according to Table 3-5.
 • Because surrogate code points are not Unicode scalar values, isolated 
 UTF-16 code units in the range 0xD800..0xDFFF are ill-formed.
 
 I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and 
 encode a non-BMP character using a  valid surrogate pair, but it should 
 reject lone surrogates both during encoding and decoding.
 
 On Python 3, the utf-16 codec can encode all the codepoints from U+ to 
 U+10 (including (lone) surrogates), but it's not able to decode lone 
 surrogates (not sure if this is by design or if it just fails because it 
 expects another (missing) surrogate).
 
 --
 
From Chapter 03 of the Unicode Standard 6[0], D90:
 
 • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode 
 scalar value to a single unsigned 32-bit code unit with the same numeric 
 value as the Unicode scalar value.
 • Because surrogate code points are not included in the set of Unicode scalar 
 values, UTF-32 code units in the range 0xD800..0xDFFF are ill-formed.
 
 I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, 
 both during encoding and during decoding.
 
 On Python 3, the utf-32 codec can encode and decode all the codepoints from 
 U+ to U+10 (including surrogates).
 
 --
 
 I think that:
   * this should be fixed in 3.3;
   * it's a bug, so the fix /might/ be backported to 3.2.  Hoverver it's also 
 a fairly big change in behavior, so it might be better to leave it for 3.3 
 only;
   * it's better to leave 2.7 alone, even the utf-8 codec is broken there;
   * the surrogatepass error handler should work with the utf-16 and utf-32 
 codecs too.
 
 
 Note that this has been already reported in #3672, but eventually only the 
 utf-8 codec was fixed.

All UTF codecs should reject lone surrogates in strict error mode,
but let them pass using the surrogatepass error handler (the UTF-8
codec already does) and apply the usual error handling for ignore
and replace.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12892
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com