Re: Encoding of surrogate code points to UTF-8

2013-10-09 Thread Steven D'Aprano
On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote: On 10/8/2013 6:30 PM, Steven D'Aprano wrote: On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote: In any case, \ud800\udc01 isn't a valid unicode string. I don't think this is correct. Can you show me where the standard says that

Re: Encoding of surrogate code points to UTF-8

2013-10-09 Thread wxjmfauth
Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit : http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; ... Each of the three Unicode encoding

Re: Encoding of surrogate code points to UTF-8

2013-10-09 Thread Ned Batchelder
On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote: Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit : http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; ...

Re: Encoding of surrogate code points to UTF-8

2013-10-09 Thread Neil Cerutti
On 2013-10-09, Ned Batchelder n...@nedbatchelder.com wrote: On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote: and what Unicode.org does not say is that these coding schemes (like any coding scheme) should be used in an exclusive way. Can you clarify what you mean by in an exclusive way? Ned,

Encoding of surrogate code points to UTF-8

2013-10-08 Thread Steven D'Aprano
I think this is a bug in Python's UTF-8 handling, but I'm not sure. If I've read the Unicode FAQs correctly, you cannot encode *lone* surrogate code points into UTF-8: http://www.unicode.org/faq/utf_bom.html#utf8-5 Sure enough, using Python 3.3: py surr = '\udc80' py surr.encode('utf-8')

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Neil Cerutti
On 2013-10-08, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: py c = '\N{LINEAR B SYLLABLE B038 E}' py surr_pair = c.encode('utf-16be') py print(surr_pair) b'\xd8\x00\xdc\x01' and then use those same values as the code points, I ought to be able to encode to UTF-8, as if it

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Pete Forman
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: I think this is a bug in Python's UTF-8 handling, but I'm not sure. [snip] py s = '\ud800\udc01' py s.encode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Neil Cerutti
On 2013-10-08, Neil Cerutti ne...@norwich.edu wrote: In any case, \ud800\udc01 isn't a valid unicode string. In a perfect world it would automatically get converted to '\u00010001' without intervention. This last paragraph is erroneous. I must have had a typo in my testing. -- Neil Cerutti

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread MRAB
On 08/10/2013 16:23, Pete Forman wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: I think this is a bug in Python's UTF-8 handling, but I'm not sure. [snip] py s = '\ud800\udc01' py s.encode('utf-8') Traceback (most recent call last): File stdin, line 1, in module

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread wxjmfauth
sys.version '3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]' '\ud800'.encode('utf-8') Traceback (most recent call last): File eta last command, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0:

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Terry Reedy
On 10/8/2013 9:52 AM, Steven D'Aprano wrote: I think this is a bug in Python's UTF-8 handling, but I'm not sure. If I've read the Unicode FAQs correctly, you cannot encode *lone* surrogate code points into UTF-8: http://www.unicode.org/faq/utf_bom.html#utf8-5 Sure enough, using Python 3.3:

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Terry Reedy
On 10/8/2013 5:47 PM, Terry Reedy wrote: On 10/8/2013 9:52 AM, Steven D'Aprano wrote: But reading the previous entry in the FAQs: http://www.unicode.org/faq/utf_bom.html#utf8-4 I interpret this as meaning that I should be able to encode valid pairs of surrogates. It says you should be

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Steven D'Aprano
On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote: The only time you should get a surrogate pair in a Unicode string is in a narrow build, which doesn't exist in Python 3.3 and later. Incorrect. py sys.version '3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)]'

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Steven D'Aprano
On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote: In any case, \ud800\udc01 isn't a valid unicode string. I don't think this is correct. Can you show me where the standard says that Unicode strings[1] may not contain surrogates? I think that is a critical point, and the FAQ conflates

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Terry Reedy
On 10/8/2013 6:30 PM, Steven D'Aprano wrote: On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote: In any case, \ud800\udc01 isn't a valid unicode string. I don't think this is correct. Can you show me where the standard says that Unicode strings[1] may not contain surrogates? I think that