[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2014-03-31 Thread Julian Mehnle
Changes by Julian Mehnle : -- nosy: +jmehnle ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Roundup Robot
Roundup Robot added the comment: New changeset 96f4cee8ea5e by Victor Stinner in branch '3.3': Issue #8271: Fix compilation on Windows http://hg.python.org/cpython/rev/96f4cee8ea5e New changeset 6f44f33460cd by Victor Stinner in branch 'default': (Merge 3.3) Issue #8271: Fix compilation on Windo

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Agree. In 2.7 UTF-8 codec still broken in corner cases (it accepts surrogates) and 3.2 is coming to an end of maintaining. In any case it is only recomendation, not demands. -- ___ Python tracker

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Ezio Melotti
Ezio Melotti added the comment: Fixed, thanks for updating the patch! I committed it on 3.3 too, and while this could have gone on 2.7/3.2 too IMHO, it's to much work to port it there and not worth it. -- status: open -> closed versions: +Python 3.3

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Roundup Robot
Roundup Robot added the comment: New changeset 5962f192a483 by Ezio Melotti in branch '3.3': #8271: the utf-8 decoder now outputs the correct number of U+FFFD characters when used with the "replace" error handler on invalid utf-8 sequences. Patch by Serhiy Storchaka, tests by Ezio Melotti. ht

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: What about commit? All Ezio's tests passsed, microbenchmark shows less than 10% differences: vanilla patched MB/s MB/s 2076 (-3%) 2007 decode utf-8 'A'*1 414 (-0%)413decode utf-8 '\x80'*1 1283 (-1%) 1275 decode utf-

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : -- versions: +Python 3.4 -Python 2.7, Python 3.1, Python 3.2, Python 3.3 ___ Python tracker ___ ___ Pytho

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : Removed file: http://bugs.python.org/file26116/issue8271-3.3-fast-2.patch ___ Python tracker ___ ___ Python-bugs-

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-11-04 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : Removed file: http://bugs.python.org/file25709/issue8271-3.3.patch ___ Python tracker ___ ___ Python-bugs-list ma

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is updated patch with resolved merge conflict with 3214c9ebcf5e. -- Added file: http://bugs.python.org/file26118/issue8271-3.3-fast-3.patch ___ Python tracker __

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is updated, a little faster, patch. It merged with decode_utf8_range_check.patch from issue14923. Patch contains non-modified Ezio Melotti's tests which all successfully passed. -- Added file: http://bugs.python.org/file26116/issue8271-3.3-fast

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: No, it is not fully fixed. Only one bug was fixed, but the current behavior is still not conformed with the Unicode Standard *recommendations*. Non-conforming with recommendations is not a bug, conforming is a feature. --

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I deleted a fast patch, since it unsafe. Issue14923 should safer compensate a small slowdown. I think this change is not a bugfix (this is not a bug, the standard allows such behavior), but a new feature, so I doubt the need to fix 2.7 and 3.2. Any chance

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Antoine Pitrou
Antoine Pitrou added the comment: Why is this marked "fixed"? Is it fixed or not? -- ___ Python tracker ___ ___ Python-bugs-list maili

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-06-23 Thread Serhiy Storchaka
Changes by Serhiy Storchaka : Removed file: http://bugs.python.org/file25720/issue8271-3.3-fast.patch ___ Python tracker ___ ___ Python-bugs-li

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-26 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Fortunately, issue14923 (if accepted) will compensate for the slowdown. On 32-bit Linux, AMD Athlon 64 X2: vanilla old patchfast patch utf-8 'A'*1 2016 (+3%) 2111 (-2%) 207

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-26 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here are the benchmark results (numbers are speed, MB/s). On 32-bit Linux, AMD Athlon 64 X2: vanilla patched utf-8 'A'*1 2016 (+5%) 2111 utf-8 '\x80'*1

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-25 Thread Ezio Melotti
Ezio Melotti added the comment: Do you have any benchmark results? -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubsc

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-25 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Here is a patch for 3.3. All of the tests pass successfully. Unfortunately, it is a little slow, but I tried to minimize the losses. -- Added file: http://bugs.python.org/file25709/issue8271-3.3.patch ___ Python t

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > I don't remember all the details right now, but it that test was passing with > my patch there must be something wrong somewhere (either in the patch, in the > test, or in our understanding of the standard). No, test correctly expects two U+FFFD. Current

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti added the comment: > I probably poorly said. Past and current implementations raise > 'unexpected end of data' and not 'invalid continuation byte'. Test > expects 'invalid continuation byte'. I don't think it matters much either way. -- ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti added the comment: > \xe0\x80 is not maximal subpart. Therefore, there must be two U+FFFD. OK, now I get what you mean. The valid range for continuation bytes that can follow E0 is A0-BF, not 80-BF as usual, so \x80 is not a valid continuation byte here. While working on the pa

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > Changing from 'unexpected end of data' to 'invalid continuation byte' for > b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, > 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though). I probably poorly said

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > This might be just because it first checks if there two more bytes before > checking if they are valid, but 'invalid continuation byte' works too. Yes, this implementation detail. It is much easier and faster. Whether it is necessary to change it? > Why n

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > I think that one U+FFFD is correct. The on;y error is a premature end of > data. I poorly expressed. I also think that there is only one decoding error, and not two. I think the test is wrong. -- ___ Python tra

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti added the comment: Changing from 'unexpected end of data' to 'invalid continuation byte' for b'\xe0\x00' is fine with me, but this will be a (minor) deviation from 2.7, 3.1, 3.2, and pypy (it could still be changed on all these except 3.1 though). If you make any changes on the t

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti added the comment: > Tests fails, but I'm not sure that the tests are correct. > b'\xe0\x00' raises 'unexpected end of data' and not 'invalid > continuation byte'. This is terminological issue. This might be just because it first checks if there two more bytes before checking if

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Saul Spatz
Saul Spatz added the comment: >b'\xe0\x80'.decode('utf-8', 'replace') returns >one U+FFFD and not two. I >don't think that is right. I think that one U+FFFD is correct. The on;y error is a premature end of data. On Thu, May 17, 2012 at 12:31 PM, Serhiy Storchaka wrote: > > Serhiy Storchaka a

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > The only issue left was about the number of U+FFFD generated with invalid > sequences in some cases. > My last patch has extensive tests for this, so you could try to apply it (or > copy the tests) and see if they all pass. Tests fails, but I'm not sure t

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Ezio Melotti
Ezio Melotti added the comment: The original bug should be fixed already in 3.3 and there should be tests (unless they got removed/skipped after we changed unicode implementation). The only issue left was about the number of U+FFFD generated with invalid sequences in some cases. My last patch

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2012-05-17 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Looks like issue14738 fixes this bug for Python 3.3. >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace'))) '\ufffdAB' >>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace'))) '\ufffdABCD' -- nosy: +storchaka

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-09-21 Thread Stefan Ring
Changes by Stefan Ring : -- nosy: +Ringding ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.o

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-08-15 Thread Ezio Melotti
Ezio Melotti added the comment: Here are some benchmarks: Commands: # half of the bytes are invalid ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "surrogateescape")' ./python -m timeit -s 'b = bytes(range(256)); b_dec = b.decode' 'b_dec("utf-8", "replace")' ./

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-07-07 Thread Saul Spatz
Changes by Saul Spatz : -- nosy: +spatz123 ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.or

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-04-19 Thread Ezio Melotti
Ezio Melotti added the comment: Attached patch against 3.1 fixes the number of FFFD. A test for the range in the error message should probably be added. I haven't done any benchmark yet. There's some code duplication, but I'm not sure it can be factored out. -- versions: +Python 3.3

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-02-28 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > Ezio Melotti added the comment: > > The patch turned out to be less trivial than I initially thought. > > The current algorithm checks for invalid continuation bytes in 4 places: > 1) before the switch/case statement in Objects/un

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-02-27 Thread Ezio Melotti
Ezio Melotti added the comment: The patch turned out to be less trivial than I initially thought. The current algorithm checks for invalid continuation bytes in 4 places: 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2011-02-25 Thread Ezio Melotti
Ezio Melotti added the comment: After a mail I sent to the Unicode Consortium about the corner case I found, they updated the "Best Practices for Using U+FFFD"[0] and now it says: """ Another example illustrates the application of the concept of maximal subpart for UTF-8 continuation bytes ou

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-12-29 Thread Alexander Belopolsky
Changes by Alexander Belopolsky : -- nosy: +belopolsky ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://ma

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-03 Thread John Machin
John Machin added the comment: About the E0 80 81 61 problem: my interpretation is that you are correct, the 80 is not valid in the current state (start byte == E0), so no look-ahead, three FFFDs must be issued followed by 0061. I don't really care about issuing too many FFFDs so long as it d

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-02 Thread Ezio Melotti
Ezio Melotti added the comment: Backported to 2.6 and 3.1 in r82470 and r82469. I'll leave this open for a while to see if anyone has any comment on my previous message. -- resolution: -> fixed stage: patch review -> committed/rejected ___ Python t

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-02 Thread Ezio Melotti
Ezio Melotti added the comment: I've found a subtle corner case about 3- and 4-bytes long sequences. For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are invalid. I.e. if the first byte is

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-07-01 Thread Ezio Melotti
Ezio Melotti added the comment: Ported to py3k in r82413. Some test with non-BMP characters should probably be added. The patch should still be ported to 2.6 and 3.1. -- ___ Python tracker _

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-06-30 Thread Ezio Melotti
Ezio Melotti added the comment: The issue about invalid surrogates in UTF-8 has been raised in #9133. -- ___ Python tracker ___ ___ Py

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-06-05 Thread Ezio Melotti
Ezio Melotti added the comment: Fixed on trunk in r81758 and r81759. I'm leaving the issue open until I port it on the other versions. -- ___ Python tracker ___ _

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-06-04 Thread Ezio Melotti
Ezio Melotti added the comment: I added a test for the 'ignore' error handler. I will commit the patch before the RC unless someone has something against it. To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629, so: 1) Invalid sequences are now handled as described i

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-07 Thread Ezio Melotti
Changes by Ezio Melotti : -- nosy: +pitrou ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.or

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-07 Thread STINNER Victor
STINNER Victor added the comment: > >> I also found out that, according to RFC 3629, surrogates > >> are considered invalid and they can't be encoded/decoded, > >> but the UTF-8 codec actually does it. > > > > Python2 does, but Python3 raises an error. > > (...) > > I wonder how that change got

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-07 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: STINNER Victor wrote: > > STINNER Victor added the comment: > >> I also found out that, according to RFC 3629, surrogates >> are considered invalid and they can't be encoded/decoded, >> but the UTF-8 codec actually does it. > > Python2 does, but Python

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-06 Thread Ezio Melotti
Ezio Melotti added the comment: The patch was causing a failure in test_codeccallbacks, issue8271v4 fixes the test. (The failing test in test_codeccallbacks was testing that registering error handlers works, using a function that replaced "\xc0\x80" with "\x00". Since now "\xc0" is an invalid

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-03 Thread Ezio Melotti
Ezio Melotti added the comment: This new patch (v3) should be ok. I added a few more tests and found another corner case: '\xe1a'.decode('utf-8', 'replace') was returning u'\ufffd' because \xe1 is the start byte of a 3-byte sequence and there were only two bytes in the string. This is now fix

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-03 Thread STINNER Victor
STINNER Victor added the comment: > I also found out that, according to RFC 3629, surrogates > are considered invalid and they can't be encoded/decoded, > but the UTF-8 codec actually does it. Python2 does, but Python3 raises an error. Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36) >>> u

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-03 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > Ezio Melotti added the comment: > > Here's a new patch. Should be complete but I want to test it some more before > committing. > I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range > F5-FD (we can always pu

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-02 Thread Ezio Melotti
Ezio Melotti added the comment: Here's a new patch. Should be complete but I want to test it some more before committing. I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range F5-FD (we can always put them back in the unlikely case that the Unicode Consortium changed its m

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin added the comment: @lemburg: """perhaps applying the same logic as for the other sequences is a better strategy""" What other sequences??? F5-FF are invalid bytes; they don't start valid sequences. What same logic?? At the start of a character, they should get the same short shar

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin added the comment: Chapter 3, page 94: """As a consequence of the well-formedness conditions specified in Table 3-7, the following byte values are disallowed in UTF-8: C0–C1, F5–FF""" Of course they should be handled by the simple expedient of setting their length entry to zero.

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > Ezio Melotti added the comment: > > Even if they are not valid they still "eat" all the 4/5/6 bytes, so they > should be fixed too. I haven't see anything about these bytes in chapter 3 so > far, but there are at least two possib

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Ezio Melotti
Ezio Melotti added the comment: Even if they are not valid they still "eat" all the 4/5/6 bytes, so they should be fixed too. I haven't see anything about these bytes in chapter 3 so far, but there are at least two possibilities: 1) consider all the bytes in range F5-FD as invalid without look

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin added the comment: Patch review: Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the r79542 that the patch me

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: John Machin wrote: > > John Machin added the comment: > > @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. I know. > The standard now says 21 bits is it. It says that the current Unicode codespace only uses 21 bits. In the early days 16

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin added the comment: @lemburg: RFC 2279 was obsoleted by RFC 3629 over 6 years ago. The standard now says 21 bits is it. F5-FF are declared to be invalid. I don't understand what you mean by "supporting those possibilities". The code is correctly issuing an error message. The goal o

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: John Machin wrote: > > John Machin added the comment: > > Unicode has been frozen at 0x10. That's it. There is no such thing as a > valid 5-byte or 6-byte UTF-8 string. The UTF-8 codec was written at a time when UTF-8 still included the possibility

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin added the comment: Unicode has been frozen at 0x10. That's it. There is no such thing as a valid 5-byte or 6-byte UTF-8 string. -- ___ Python tracker ___ ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: Ezio Melotti wrote: > > Ezio Melotti added the comment: > > Here is an incomplete patch. It seems to solve the problem but I still have > to add more tests and check it better. Thanks. Please also check whether it's worthwhile unrolling those loops by h

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Ezio Melotti
Ezio Melotti added the comment: Here is an incomplete patch. It seems to solve the problem but I still have to add more tests and check it better. I also wonder if the sequences with the first byte in range F5-FD (start of 4/5/6-byte sequences, restricted by RFC 3629) should behave in the same

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: John Machin wrote: > > John Machin added the comment: > > @lemburg: "failing byte" seems rather obvious: first byte that you meet that > is not valid in the current state. I don't understand your explanation, > especially "does not have the high bit set

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread Ezio Melotti
Ezio Melotti added the comment: That's why I'm writing tests that cover all the cases, including overlong sequences. If the test will fail I'll change the patch :) -- ___ Python tracker ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-04-01 Thread John Machin
John Machin added the comment: #ezio.melotti: """I'm considering valid all the bytes that start with '10...'""" Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are NOT always valid as the first continuation byte, it depends on what starter byte they follow.""" Cons

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Ezio Melotti
Ezio Melotti added the comment: Yes, right now I'm considering valid all the bytes that start with '10...'. C2 starts with '11...' so it's a "failing byte". -- ___ Python tracker __

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread John Machin
John Machin added the comment: @ezio.melotti: Your second sentence is true, but it is not the whole truth. Bytes in the range C0-FF (whose high bit *is* set) ALSO shouldn't be considered part of the sequence because they (like 00-7F) are invalid as continuation bytes; they are either starter

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Ezio Melotti
Changes by Ezio Melotti : -- assignee: -> ezio.melotti ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://m

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Ezio Melotti
Ezio Melotti added the comment: Having the 'high bit set' means that the first bit is set to 1. All the continuation bytes (i.e. the 2nd, 3rd or 4th byte in a sequence) have the first two bits set to 1 and 0 respectively, so if the first bit is not set to 1 then the byte shouldn't be considere

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread John Machin
John Machin added the comment: @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below. Ex

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Marc-Andre Lemburg
Marc-Andre Lemburg added the comment: I guess the term "failing" byte somewhat underdefined. Page 95 of the standard PDF (http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf) suggests to "Replace each maximal subpart of an ill-formed subsequence by a single U+FFFD". Fortunately, they expla

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread R. David Murray
Changes by R. David Murray : -- nosy: +lemburg ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.pytho

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Daniel Graña
Daniel Graña added the comment: Some background for this report at http://stackoverflow.com/questions/2547262/why-is-python-decode-replacing-more-than-the-invalid-bytes-from-an-encoded-string/2548480 -- ___ Python tracker

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-31 Thread Daniel Graña
Changes by Daniel Graña : -- nosy: +dangra ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.or

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-30 Thread Ezio Melotti
Changes by Ezio Melotti : -- components: +Unicode nosy: +ezio.melotti priority: -> normal stage: -> test needed versions: +Python 3.2 ___ Python tracker ___

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

2010-03-30 Thread John Machin
New submission from John Machin : Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example: >>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace'))) '\ufff