Re: [Demexp-dev] Character encoding
Hello Lyu, 2007/10/22, Lyu Abe [EMAIL PROTECTED]: There's one thing I do not understand in character coding of the server's reply. When I display, for example, tag sets, I can read this: 'a_tag_label': u'citoyennet\xe9' in which u'citoyennet\xe9' corresponds to an unicode encoded text, right? Yes. Then I do not understand why we get unicode encoded strings, while DEMEXP is supposed to have UTF-8 encoding... UTF-8 is the byte-oriented encoding form of Unicode. http://www.unicode.org/faq/utf_bom.html#2 In other words, all strings on the server are stored in the UTF-8 byte encoding of the Unicode encoding. All exchanges between the server and the clients are done in UTF-8, a byte convention to represent Unicode characters. After that, each platform is free to do any appropriate conversion, e.g. use 16 or 32 bits character encoding if they will. However, you should take care to set the default Python encoding to UTF-8 when you dialogue with the server. To be honest, right now, the server does not check much this encoding. It mainly came from the GTK2 interface that produces UTF-8 strings. :-) But that should be done at one point. Best wishes, d. ___ Demexp-dev mailing list Demexp-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/demexp-dev
Re: [Demexp-dev] Character encoding
Hi, Le Mon, 22 Oct 2007 14:40:46 +0900, Lyu Abe [EMAIL PROTECTED] a écrit : There's one thing I do not understand in character coding of the server's reply. When I display, for example, tag sets, I can read this: 'a_tag_label': u'citoyennet\xe9' in which u'citoyennet\xe9' corresponds to an unicode encoded text, right? Then I do not understand why we get unicode encoded strings, while DEMEXP is supposed to have UTF-8 encoding... The string you mention is encoded in ISO-8859-1 (or ISO-8859-15) : the special character é is encoded on one byte only, so it's not UTF-8. You're also making a confusion between Unicode and UTF-8. Unicode associates each character with an unique number, and UTF-8 allows to encode that number is a certain way. There are various way of encoding Unicode numbers (UTF-7, UTF-8, UTF-16, UTF-32, UCS-2, etc.). See http://en.wikipedia.org/wiki/Unicode for more information. Sincerly, Thomas -- Thomas Petazzoni - [EMAIL PROTECTED] http://{thomas,sos,kos}.enix.org - http://www.toulibre.org http://www.{livret,agenda}dulibre.org ___ Demexp-dev mailing list Demexp-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/demexp-dev
Re: [Demexp-dev] Character encoding
Hi Thomas and David, Thanks for the clarification! Lyu. Thomas Petazzoni a écrit : Hi, Le Mon, 22 Oct 2007 14:40:46 +0900, Lyu Abe [EMAIL PROTECTED] a écrit : There's one thing I do not understand in character coding of the server's reply. When I display, for example, tag sets, I can read this: 'a_tag_label': u'citoyennet\xe9' in which u'citoyennet\xe9' corresponds to an unicode encoded text, right? Then I do not understand why we get unicode encoded strings, while DEMEXP is supposed to have UTF-8 encoding... The string you mention is encoded in ISO-8859-1 (or ISO-8859-15) : the special character é is encoded on one byte only, so it's not UTF-8. You're also making a confusion between Unicode and UTF-8. Unicode associates each character with an unique number, and UTF-8 allows to encode that number is a certain way. There are various way of encoding Unicode numbers (UTF-7, UTF-8, UTF-16, UTF-32, UCS-2, etc.). See http://en.wikipedia.org/wiki/Unicode for more information. Sincerly, Thomas ___ Demexp-dev mailing list Demexp-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/demexp-dev
Re: [Demexp-dev] Character encoding
Hello Thomas, 2007/10/22, Thomas Petazzoni [EMAIL PROTECTED]: The string you mention is encoded in ISO-8859-1 (or ISO-8859-15) : the special character é is encoded on one byte only, so it's not UTF-8. I'm not sure of that. If you look at the Unicode table for Latin1 (http://www.unicode.org/charts/PDF/U0080.pdf), the encoding of é is 00E9 (p. 7). As the string is explicitly marked as Unicode string (u'string') in Python, I would say that this indeed an Unicode string, with the é showed in hexadecimal. Yours, d. ___ Demexp-dev mailing list Demexp-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/demexp-dev
Re: [Demexp-dev] Character encoding
Hi, Le Mon, 22 Oct 2007 09:18:23 +0200, David MENTRE [EMAIL PROTECTED] a écrit : I'm not sure of that. If you look at the Unicode table for Latin1 (http://www.unicode.org/charts/PDF/U0080.pdf), the encoding of é is 00E9 (p. 7). I'm not sure too :-) On a system with LANG=fr_FR, I run a Python interpreter: s = ucitoyennet\xe9 s u'citoyennet\xe9' print s citoyenneté - It is displayed correctly. s.encode('utf-8') 'citoyennet\xc3\xa9' And here we have the string encoded in utf-8. print s.encode('utf-8') citoyenneté - It is not displayed correctly But even with that, I'm still not sure to understand completely. These encodings issues are really tough to grasp. Sincerly, Thomas -- Thomas Petazzoni - [EMAIL PROTECTED] http://{thomas,sos,kos}.enix.org - http://www.toulibre.org http://www.{livret,agenda}dulibre.org ___ Demexp-dev mailing list Demexp-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/demexp-dev
Re: [Demexp-dev] Character encoding
Hi Thomas, 2007/10/22, Thomas Petazzoni [EMAIL PROTECTED]: But even with that, I'm still not sure to understand completely. These encodings issues are really tough to grasp. Yep, I agree. I only hope we don't have an encoding mess in the official database. I'll need to check that. One more thing to check. Yours, d. ___ Demexp-dev mailing list Demexp-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/demexp-dev
[Demexp-dev] Character encoding
David, There's one thing I do not understand in character coding of the server's reply. When I display, for example, tag sets, I can read this: 'a_tag_label': u'citoyennet\xe9' in which u'citoyennet\xe9' corresponds to an unicode encoded text, right? Then I do not understand why we get unicode encoded strings, while DEMEXP is supposed to have UTF-8 encoding... Thanks. Lyu ___ Demexp-dev mailing list Demexp-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/demexp-dev