On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote:
On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote:
In any case, \ud800\udc01 isn't a valid unicode string.
I don't think this is correct. Can you show me where the standard says
that
Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 All three
encoding forms can be used to represent the full range of encoded
characters in the Unicode Standard; ... Each of the three Unicode
encoding
On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote:
Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit :
http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 All three
encoding forms can be used to represent the full range of encoded
characters in the Unicode Standard; ...
On 2013-10-09, Ned Batchelder n...@nedbatchelder.com wrote:
On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote:
and what Unicode.org does not say is that these coding schemes
(like any coding scheme) should be used in an exclusive way.
Can you clarify what you mean by in an exclusive way?
Ned,
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5
Sure enough, using Python 3.3:
py surr = '\udc80'
py surr.encode('utf-8')
On 2013-10-08, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
py c = '\N{LINEAR B SYLLABLE B038 E}'
py surr_pair = c.encode('utf-16be')
py print(surr_pair)
b'\xd8\x00\xdc\x01'
and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
[snip]
py s = '\ud800\udc01'
py s.encode('utf-8')
Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't
On 2013-10-08, Neil Cerutti ne...@norwich.edu wrote:
In any case, \ud800\udc01 isn't a valid unicode string. In a
perfect world it would automatically get converted to
'\u00010001' without intervention.
This last paragraph is erroneous. I must have had a typo in my
testing.
--
Neil Cerutti
On 08/10/2013 16:23, Pete Forman wrote:
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
[snip]
py s = '\ud800\udc01'
py s.encode('utf-8')
Traceback (most recent call last):
File stdin, line 1, in module
sys.version
'3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]'
'\ud800'.encode('utf-8')
Traceback (most recent call last):
File eta last command, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position
0:
On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5
Sure enough, using Python 3.3:
On 10/8/2013 5:47 PM, Terry Reedy wrote:
On 10/8/2013 9:52 AM, Steven D'Aprano wrote:
But reading the previous entry in the FAQs:
http://www.unicode.org/faq/utf_bom.html#utf8-4
I interpret this as meaning that I should be able to encode valid pairs
of surrogates.
It says you should be
On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote:
The only time you should get a surrogate pair in a Unicode string is in
a narrow build, which doesn't exist in Python 3.3 and later.
Incorrect.
py sys.version
'3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat
4.1.2-52)]'
On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote:
In any case, \ud800\udc01 isn't a valid unicode string.
I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a
critical point, and the FAQ conflates
On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote:
In any case, \ud800\udc01 isn't a valid unicode string.
I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that
15 matches
Mail list logo