[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

2014-03-12 Thread Julian Mehnle

Changes by Julian Mehnle jul...@mehnle.net:


--
nosy: +jmehnle

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9133
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

2010-06-30 Thread Mike Lewis

New submission from Mike Lewis mikelikes...@gmail.com:

When I do
codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8')

its not throwing an exception.  '\xed\xbc\xad' is an invalid UTF8 byte sequence.

It maps to the value U+DF2D which is a surrogate pair it seems.

http://tools.ietf.org/html/rfc3629#section-4

explains:

  However, pairs of
  UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
  parlance), being actually UCS-4 characters transformed through
  UTF-16, need special treatment: the UTF-16 transformation must be
  undone, yielding a UCS-4 character that is then transformed as
  above.

which would suggest that it is invalid.

However, I think wikipedia's explanation is a bit clearer:

UTF-8 may only legally be used to encode valid Unicode scalar values. According 
to the Unicode standard the high and low surrogate halves used by UTF-16 
(U+D800 through U+DFFF) and values above U+10 are not legal Unicode values, 
and the UTF-8 encoding of them is an invalid byte sequence and should be 
treated as described above.


Thanks,
Mike

--
components: Unicode
messages: 109010
nosy: Mike.Lewis
priority: normal
severity: normal
status: open
title: Invalid UTF8 Byte sequence not raising exception/being substituted
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9133
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

2010-06-30 Thread Mike Lewis

Mike Lewis mikelikes...@gmail.com added the comment:

Sorry, meant to add this part to the quote from the rfc:

This leads to different results for character
   numbers above 0x; the CESU-8 encoding of those characters is NOT
   valid UTF-8

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9133
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

2010-06-30 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

This is already fixed in Python 3.
However I think that for backward compatibility reasons it can't be fixed in 
Python 2, where it is possible to encode and decode every codepoint to/from 
UTF-8.

See also http://bugs.python.org/issue8271#msg102209

I think this can be closed as wontfix.

--
nosy: +ezio.melotti, haypo, lemburg
status: open - pending
type:  - behavior
versions: +Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9133
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

2010-06-30 Thread Marc-Andre Lemburg

Changes by Marc-Andre Lemburg m...@egenix.com:


--
resolution:  - wont fix
status: pending - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9133
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

2010-06-30 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Ezio Melotti wrote:
 
 I think this can be closed as wontfix.

Agreed. I've already closed the ticket.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9133
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted

2010-06-30 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
stage:  - committed/rejected

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9133
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com