hi experts,
i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.
i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in gb2312, but i have no idea of how to convert it
back to utf-8
to re-create
2010/4/1 Mister Yu eryan...@gmail.com:
hi experts,
i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.
i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in gb2312,
No! Instances of type
On Apr 1, 7:22 pm, Chris Rebert c...@rebertia.com wrote:
2010/4/1 Mister Yu eryan...@gmail.com:
hi experts,
i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.
i have a unicode object, which looks like this
On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu eryan...@gmail.com wrote:
On Apr 1, 7:22 pm, Chris Rebert c...@rebertia.com wrote:
2010/4/1 Mister Yu eryan...@gmail.com:
hi experts,
i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding
Mister Yu, 01.04.2010 13:38:
i m still not very sure how to convert a unicode object **
u'\xd6\xd0\xce\xc4 ** back to 中文 the string it supposed to be?
You are confused. '\xd6\xd0\xce\xc4' is an encoded byte string, not a
unicode string. The fact that you have it stored in a unicode string
On Apr 1, 8:13 pm, Chris Rebert c...@rebertia.com wrote:
On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu eryan...@gmail.com wrote:
On Apr 1, 7:22 pm, Chris Rebert c...@rebertia.com wrote:
2010/4/1 Mister Yu eryan...@gmail.com:
hi experts,
i m new to python, i m writing crawlers to extract
Mister Yu, 01.04.2010 14:26:
On Apr 1, 8:13 pm, Chris Rebert wrote:
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted
Simplifying this hack a bit:
gb2312_bytes =
On Apr 1, 9:31 pm, Stefan Behnel stefan...@behnel.de wrote:
Mister Yu, 01.04.2010 14:26:
On Apr 1, 8:13 pm, Chris Rebert wrote:
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8')