Bjorn Stabell wrote:
--On Montag, 26. April 2004 10:53 Uhr +0200 David Convent <[EMAIL PROTECTED]> wrote:


I always believed that unicode and utf-8 were same encoding, but reading you let me think i was wrong.
Can you tell me what the difference is between unicode and utf-8 ?


Andreas Jung wrote:


Unicode is common database for almost all characters. UTF-8 is an *encoding* that allows you to represent any element of this character database as set for 1,2,3 or 4 bytes. There are also other encoding e.g. like UTF16 that encode an element in a different way....so we are talking about completely different things.


Yes, the difference is that Python has a whole different understanding of
Unicode strings (type(u"")) than it has of text of some character encoding
(e.g., UTF-8, GB18030, ISO8859-1, ASCII, stored as type("")).  Python will
of course represent these unicode strings internally some way (maybe as a
16-bit integer?), but we don't need to know what that is like.  All we need
to know is that this is a string that can contain any character on the
planet, and that we can reasonably expect normal text operations to work on
it.

UTF-8 is, similar to ISO-8869-1 (latin1), just a character encoding.  It
(and UTF16, UCS2, UCS4) is only special in that it was issued by the Unicode
consortium and can encode any Unicode character.  Wherease ISO-8859-1 (for
example), being only 8 bits, can only encode characters used in Western
Europe.  GB18030, to take another extreme, is a 32-bit encoding endorsed by
the Chinese govnerment; being 32-bit, it can encode/represent a lot of
Unicode characters, even many non-Chinese ones; it is big enough to
potentially encode any Unicode character, if the Chinese government defined
how each Unicode code point was mapped into GB18030.  In this case, it would
be similar in function to UCS4 (I think it is).

Internally, we want to work with Unicode strings (where str[4] is the 4th
character) instead of UTF-8 encoded text strings (where str[4], being the
4th byte, has little semantic meaning).

And to illustrate this by way of an example consider this Python
session (copied from a recent posting on plone.devel but included
here again for the records)

<begin:quote>
This is a common missunderstanding when it comes to
unicode in Python.

Consider

string1 = u"This is a unicode string"

string2 = string1.encode('utf-8')

Here, type(string1) = unicode whereas type(string2) = string,
i.e., string1 is a proper Python unicode string object whereas
string2 is a utf-8 encoded proper Python string object.

Or consider the following Python session:

[EMAIL PROTECTED] ritz]$ python
Python 2.3.3 (#5, Dec 30 2003, 15:25:24)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s1 = u"δόφ"
>>> print repr(s1)
u'\xc3\xa4\xc3\xbc\xc3\xb6'
>>> s2 = s1.encode('utf-8')
>>> print repr(s2)
'\xc3\x83\xc2\xa4\xc3\x83\xc2\xbc\xc3\x83\xc2\xb6'
>>> type(s1)
<type 'unicode'>
>>> type(s2)
<type 'str'>
>>>


Maybe that clarifies things a bit? <end:quote>

Raphael


Bye,



_______________________________________________
Zope-Dev maillist - [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists - http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope )

Reply via email to