Encode exception for chinese text

2006-05-19 Thread Vinayakc
Hi all,

I am new to python.

I have written one small application which reads data from xml file and
tries to encode data using apprpriate charset.
I am facing problem while encoding one chinese paragraph with charset
gb2312.

code is:

encoded_str = str_data.encode(gb2312)

The type of str_data is type 'unicode'

The exception is:

UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
position 0: illegal multibyte sequence

Can anyone please give me direction to solve this isssue.

Regards,
Vinayakc

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encode exception for chinese text

2006-05-19 Thread swordsp
Are you sure all the characters in original text are in gb2312
charset?

Encoding with utf8 seems work for this character (u'\xa0'), but I
don't know if the result is correct.

Could you give a subset of str_data in unicode?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encode exception for chinese text

2006-05-19 Thread Serge Orlov
Vinayakc wrote:
 Hi all,

 I am new to python.

 I have written one small application which reads data from xml file and
 tries to encode data using apprpriate charset.
 I am facing problem while encoding one chinese paragraph with charset
 gb2312.

 code is:

 encoded_str = str_data.encode(gb2312)

 The type of str_data is type 'unicode'

 The exception is:

 UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa0' in
 position 0: illegal multibyte sequence

Hmm, this is 'no-break space' in the very beginning of the text. It
look suspiciously like a  plain text utf-8 signature which is 'zero
width no-break space'. If you strip the first character do you still
have encoding errors?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encode exception for chinese text

2006-05-19 Thread Vinayakc
Yes serge, I have removed the first character but it is still giving
encoding exception.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encode exception for chinese text

2006-05-19 Thread John Machin
1. *By definition*, you can encode *any* Unicode string into utf-8.
Proves nothing.
2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
later gbk alias cp936. It does have an equivalent in the latest Chinese
encoding, gb18030.
3. gb2312 is outdated. It is not really an appropriate charset for
anything much these days. You need to check out what your requirements
really are. The unknowing will cheerfully use gb to mean one or more
of those, or to mean anything that's not big5 :-)
4. The slab of text you supplied is genuine unicode and encodes happily
into all those gb* charsets. It does *not* contain \u00a0.

I do hope some of this helps 

Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encode exception for chinese text

2006-05-19 Thread Serge Orlov
Vinayakc wrote:
 Yes serge, I have removed the first character but it is still giving
 encoding exception.

Then I guess this character was used as a poor man indentation tool at
least in the beginning of your text. It's up to you to decide what to
do with that character, you have several choices:

* edit source xml file to get rid of it
* remove it while you process your data
* replace it with ordinary space
* consider utf-8

Note, there are legitimate use cases for no-break space, for example
one million can be written like 1 000 000, where spaces are
non-breakable. This prevents the number to be broken by right margin
like this: 1 000
000

Keep that in mind when you remove or replace no-break space.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encode exception for chinese text

2006-05-19 Thread Vinayakc
Hey Serge, john,

Thank you very much. I was really not aware of these facts. Anyways
this is happening only for one in millions so I can ignore this for
now. 

Thanks again,

Vinayakc

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encode exception for chinese text

2006-05-19 Thread Martin v. Löwis
John Machin wrote:
 1. *By definition*, you can encode *any* Unicode string into utf-8.
 Proves nothing.
 2. \u00a0 [no-break space] has no equivalent in gb2312, nor in the
 later gbk alias cp936. It does have an equivalent in the latest Chinese
 encoding, gb18030.

Also, *by definition*, though :-) For those that have not followed
encodings too closely: gb18030 is to gb2312 what UTF-8 is to ASCII.
Both encode the entire Unicode in an algorithmic way, and provide
byte-for-byte identical encodings for the for their respective
subset.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encode exception for chinese text

2006-05-19 Thread John Machin
MvL wrote:
 Also, *by definition*, though :-)

Ah yes, indeed; and thanks for reminding me. Aside: Similar definition,
but not similar design: IMHO utf-8 sits on top of ASCII like a rose on
a stalk, whereas gb18030 sits on top of gb2312 like a rhinoceros on a
unicycle :-)
Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list