Treating a unicode string as latin-1

2008-01-03 Thread Simon Willison
Hello, I'm using ElementTree to parse an XML file which includes some data encoded as cp1252, for example: nameBob\x92s Breakfast/name If this was a regular bytestring, I would convert it to utf8 using the following: print 'Bob\x92s Breakfast'.decode('cp1252').encode('utf8') Bob's Breakfast

Re: Treating a unicode string as latin-1

2008-01-03 Thread Duncan Booth
Simon Willison [EMAIL PROTECTED] wrote: How can I tell Python I know this says it's a unicode string, but I need you to treat it like a bytestring? Can you not just fix your xml file so that it uses the same encoding as it claims to use? If the xml says it contains utf8 encoded data then it

Re: Treating a unicode string as latin-1

2008-01-03 Thread Paul Hankin
On Jan 3, 1:31 pm, Simon Willison [EMAIL PROTECTED] wrote: How can I tell Python I know this says it's a unicode string, but I need you to treat it like a bytestring? u'Bob\x92s Breakfast'.encode('latin-1') -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list

Re: Treating a unicode string as latin-1

2008-01-03 Thread Diez B. Roggisch
Simon Willison wrote: Hello, I'm using ElementTree to parse an XML file which includes some data encoded as cp1252, for example: nameBob\x92s Breakfast/name If this was a regular bytestring, I would convert it to utf8 using the following: print 'Bob\x92s

Re: Treating a unicode string as latin-1

2008-01-03 Thread Jeroen Ruigrok van der Werven
-On [20080103 14:36], Simon Willison ([EMAIL PROTECTED]) wrote: How can I tell Python I know this says it's a unicode string, but I need you to treat it like a bytestring? Although it does not address the exact question it does raise the issue how you are using ElementTree. When I use the

Re: Treating a unicode string as latin-1

2008-01-03 Thread Fredrik Lundh
Simon Willison wrote: But ElementTree gives me back a unicode string, so I get the following error: print u'Bob\x92s Breakfast'.decode('cp1252').encode('utf8') Traceback (most recent call last): File stdin, line 1, in module File /Library/Frameworks/Python.framework/Versions/2.5/lib/

Re: Treating a unicode string as latin-1

2008-01-03 Thread Duncan Booth
Fredrik Lundh [EMAIL PROTECTED] wrote: ET has already decoded the CP1252 data for you. If you want UTF-8, all you need to do is to encode it: u'Bob\x92s Breakfast'.encode('utf8') 'Bob\xc2\x92s Breakfast' I think he is claiming that the encoding information in the file is incorrect and

Re: Treating a unicode string as latin-1

2008-01-03 Thread Diez B. Roggisch
Duncan Booth schrieb: Fredrik Lundh [EMAIL PROTECTED] wrote: ET has already decoded the CP1252 data for you. If you want UTF-8, all you need to do is to encode it: u'Bob\x92s Breakfast'.encode('utf8') 'Bob\xc2\x92s Breakfast' I think he is claiming that the encoding information in the

Re: Treating a unicode string as latin-1

2008-01-03 Thread Fredrik Lundh
Diez B. Roggisch wrote: I would think it more likely that he wants to end up with u'Bob\u2019s Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner' seems a probable consequence. If that's the case, he should read the file as string, de- and encode it (probably into