Encode exception for chinese text
Hi all, I am new to python. I have written one small application which reads data from xml file and tries to encode data using apprpriate charset. I am facing problem while encoding one chinese paragraph with charset gb2312. code is: encoded_str = str_data.encode(gb2312) The type of str_data is type 'unicode' The exception is: UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in position 0: illegal multibyte sequence Can anyone please give me direction to solve this isssue. Regards, Vinayakc -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
Are you sure all the characters in original text are in gb2312 charset? Encoding with utf8 seems work for this character (u'\xa0'), but I don't know if the result is correct. Could you give a subset of str_data in unicode? -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
Vinayakc wrote: Hi all, I am new to python. I have written one small application which reads data from xml file and tries to encode data using apprpriate charset. I am facing problem while encoding one chinese paragraph with charset gb2312. code is: encoded_str = str_data.encode(gb2312) The type of str_data is type 'unicode' The exception is: UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in position 0: illegal multibyte sequence Hmm, this is 'no-break space' in the very beginning of the text. It look suspiciously like a plain text utf-8 signature which is 'zero width no-break space'. If you strip the first character do you still have encoding errors? -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
Yes serge, I have removed the first character but it is still giving encoding exception. -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
1. *By definition*, you can encode *any* Unicode string into utf-8. Proves nothing. 2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the later gbk alias cp936. It does have an equivalent in the latest Chinese encoding, gb18030. 3. gb2312 is outdated. It is not really an appropriate charset for anything much these days. You need to check out what your requirements really are. The unknowing will cheerfully use gb to mean one or more of those, or to mean anything that's not big5 :-) 4. The slab of text you supplied is genuine unicode and encodes happily into all those gb* charsets. It does *not* contain \u00a0. I do hope some of this helps Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
Vinayakc wrote: Yes serge, I have removed the first character but it is still giving encoding exception. Then I guess this character was used as a poor man indentation tool at least in the beginning of your text. It's up to you to decide what to do with that character, you have several choices: * edit source xml file to get rid of it * remove it while you process your data * replace it with ordinary space * consider utf-8 Note, there are legitimate use cases for no-break space, for example one million can be written like 1 000 000, where spaces are non-breakable. This prevents the number to be broken by right margin like this: 1 000 000 Keep that in mind when you remove or replace no-break space. -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
Hey Serge, john, Thank you very much. I was really not aware of these facts. Anyways this is happening only for one in millions so I can ignore this for now. Thanks again, Vinayakc -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
John Machin wrote: 1. *By definition*, you can encode *any* Unicode string into utf-8. Proves nothing. 2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the later gbk alias cp936. It does have an equivalent in the latest Chinese encoding, gb18030. Also, *by definition*, though :-) For those that have not followed encodings too closely: gb18030 is to gb2312 what UTF-8 is to ASCII. Both encode the entire Unicode in an algorithmic way, and provide byte-for-byte identical encodings for the for their respective subset. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Encode exception for chinese text
MvL wrote: Also, *by definition*, though :-) Ah yes, indeed; and thanks for reminding me. Aside: Similar definition, but not similar design: IMHO utf-8 sits on top of ASCII like a rose on a stalk, whereas gb18030 sits on top of gb2312 like a rhinoceros on a unicycle :-) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list