Re: [Demexp-dev] Character encoding

2007-10-22 Par sujet David MENTRE
Hello Lyu,

2007/10/22, Lyu Abe [EMAIL PROTECTED]:
 There's one thing I do not understand in character coding of the
 server's reply. When I display, for example, tag sets, I can read this:

 'a_tag_label': u'citoyennet\xe9'

 in which   u'citoyennet\xe9'  corresponds to an unicode encoded text,
 right?

Yes.

 Then I do not understand why we get unicode encoded strings,
 while DEMEXP is supposed to have UTF-8 encoding...

UTF-8 is the byte-oriented encoding form of Unicode.
http://www.unicode.org/faq/utf_bom.html#2

In other words, all strings on the server are stored in the UTF-8 byte
encoding of the Unicode encoding. All exchanges between the server and
the clients are done in UTF-8, a byte convention to represent Unicode
characters.

After that, each platform is free to do any appropriate conversion,
e.g. use 16 or 32 bits character encoding if they will. However, you
should take care to set the default Python encoding to UTF-8 when you
dialogue with the server.

To be honest, right now, the server does not check much this encoding.
It mainly came from the GTK2 interface that produces UTF-8 strings.
:-) But that should be done at one point.

Best wishes,
d.


___
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev


Re: [Demexp-dev] Character encoding

2007-10-22 Par sujet Thomas Petazzoni
Hi,

Le Mon, 22 Oct 2007 14:40:46 +0900,
Lyu Abe [EMAIL PROTECTED] a écrit :

 There's one thing I do not understand in character coding of the
 server's reply. When I display, for example, tag sets, I can read
 this:
 
 'a_tag_label': u'citoyennet\xe9'
 
 in which   u'citoyennet\xe9'  corresponds to an unicode encoded
 text, right? Then I do not understand why we get unicode encoded
 strings, while DEMEXP is supposed to have UTF-8 encoding...

The string you mention is encoded in ISO-8859-1 (or ISO-8859-15) : the
special character é is encoded on one byte only, so it's not UTF-8.

You're also making a confusion between Unicode and UTF-8. Unicode
associates each character with an unique number, and UTF-8 allows to
encode that number is a certain way. There are various way of encoding
Unicode numbers (UTF-7, UTF-8, UTF-16, UTF-32, UCS-2, etc.).

See http://en.wikipedia.org/wiki/Unicode for more information.

Sincerly,

Thomas
-- 
Thomas Petazzoni - [EMAIL PROTECTED]
http://{thomas,sos,kos}.enix.org - http://www.toulibre.org
http://www.{livret,agenda}dulibre.org


___
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev


Re: [Demexp-dev] Character encoding

2007-10-22 Par sujet Lyu Abe

Hi Thomas and David,

Thanks for the clarification!

Lyu.

Thomas Petazzoni a écrit :

Hi,

Le Mon, 22 Oct 2007 14:40:46 +0900,
Lyu Abe [EMAIL PROTECTED] a écrit :


There's one thing I do not understand in character coding of the
server's reply. When I display, for example, tag sets, I can read
this:

'a_tag_label': u'citoyennet\xe9'

in which   u'citoyennet\xe9'  corresponds to an unicode encoded
text, right? Then I do not understand why we get unicode encoded
strings, while DEMEXP is supposed to have UTF-8 encoding...


The string you mention is encoded in ISO-8859-1 (or ISO-8859-15) : the
special character é is encoded on one byte only, so it's not UTF-8.

You're also making a confusion between Unicode and UTF-8. Unicode
associates each character with an unique number, and UTF-8 allows to
encode that number is a certain way. There are various way of encoding
Unicode numbers (UTF-7, UTF-8, UTF-16, UTF-32, UCS-2, etc.).

See http://en.wikipedia.org/wiki/Unicode for more information.

Sincerly,

Thomas



___
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev


Re: [Demexp-dev] Character encoding

2007-10-22 Par sujet David MENTRE
Hello Thomas,

2007/10/22, Thomas Petazzoni [EMAIL PROTECTED]:
 The string you mention is encoded in ISO-8859-1 (or ISO-8859-15) : the
 special character é is encoded on one byte only, so it's not UTF-8.

I'm not sure of that. If you look at the Unicode table for Latin1
(http://www.unicode.org/charts/PDF/U0080.pdf), the encoding of é is
00E9 (p. 7).

As the string is explicitly marked as Unicode string (u'string') in
Python, I would say that this indeed an Unicode string, with the é
showed in hexadecimal.

Yours,
d.


___
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev


Re: [Demexp-dev] Character encoding

2007-10-22 Par sujet Thomas Petazzoni
Hi,

Le Mon, 22 Oct 2007 09:18:23 +0200,
David MENTRE [EMAIL PROTECTED] a écrit :

 I'm not sure of that. If you look at the Unicode table for Latin1
 (http://www.unicode.org/charts/PDF/U0080.pdf), the encoding of é is
 00E9 (p. 7).

I'm not sure too :-)

On a system with LANG=fr_FR, I run a Python interpreter:

 s = ucitoyennet\xe9
 s
u'citoyennet\xe9'
 print s
citoyenneté

 - It is displayed correctly.

 s.encode('utf-8')
'citoyennet\xc3\xa9'

And here we have the string encoded in utf-8.

 print s.encode('utf-8')
citoyenneté

 - It is not displayed correctly

But even with that, I'm still not sure to understand completely. These
encodings issues are really tough to grasp.

Sincerly,

Thomas
-- 
Thomas Petazzoni - [EMAIL PROTECTED]
http://{thomas,sos,kos}.enix.org - http://www.toulibre.org
http://www.{livret,agenda}dulibre.org


___
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev


Re: [Demexp-dev] Character encoding

2007-10-22 Par sujet David MENTRE
Hi Thomas,

2007/10/22, Thomas Petazzoni [EMAIL PROTECTED]:
 But even with that, I'm still not sure to understand completely. These
 encodings issues are really tough to grasp.

Yep, I agree. I only hope we don't have an encoding mess in the
official database. I'll need to check that. One more thing to check.

Yours,
d.


___
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev


[Demexp-dev] Character encoding

2007-10-21 Par sujet Lyu Abe

David,

There's one thing I do not understand in character coding of the
server's reply. When I display, for example, tag sets, I can read this:

'a_tag_label': u'citoyennet\xe9'

in which   u'citoyennet\xe9'  corresponds to an unicode encoded text,
right? Then I do not understand why we get unicode encoded strings,
while DEMEXP is supposed to have UTF-8 encoding...

Thanks. Lyu



___
Demexp-dev mailing list
Demexp-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/demexp-dev