Bjorn Stabell wrote:
--On Montag, 26. April 2004 10:53 Uhr +0200 David Convent <[EMAIL PROTECTED]> wrote:
I always believed that unicode and utf-8 were same encoding, but reading you let me think i was wrong.
Can you tell me what the difference is between unicode and utf-8 ?
Andreas Jung wrote:
Unicode is common database for almost all characters. UTF-8 is an *encoding* that allows you to represent any element of this character database as set for 1,2,3 or 4 bytes. There are also other encoding e.g. like UTF16 that encode an element in a different way....so we are talking about completely different things.
Yes, the difference is that Python has a whole different understanding of Unicode strings (type(u"")) than it has of text of some character encoding (e.g., UTF-8, GB18030, ISO8859-1, ASCII, stored as type("")). Python will of course represent these unicode strings internally some way (maybe as a 16-bit integer?), but we don't need to know what that is like. All we need to know is that this is a string that can contain any character on the planet, and that we can reasonably expect normal text operations to work on it.
UTF-8 is, similar to ISO-8869-1 (latin1), just a character encoding. It (and UTF16, UCS2, UCS4) is only special in that it was issued by the Unicode consortium and can encode any Unicode character. Wherease ISO-8859-1 (for example), being only 8 bits, can only encode characters used in Western Europe. GB18030, to take another extreme, is a 32-bit encoding endorsed by the Chinese govnerment; being 32-bit, it can encode/represent a lot of Unicode characters, even many non-Chinese ones; it is big enough to potentially encode any Unicode character, if the Chinese government defined how each Unicode code point was mapped into GB18030. In this case, it would be similar in function to UCS4 (I think it is).
Internally, we want to work with Unicode strings (where str is the 4th character) instead of UTF-8 encoded text strings (where str, being the 4th byte, has little semantic meaning).
And to illustrate this by way of an example consider this Python session (copied from a recent posting on plone.devel but included here again for the records)
<begin:quote> This is a common missunderstanding when it comes to unicode in Python.
string1 = u"This is a unicode string"
string2 = string1.encode('utf-8')
Here, type(string1) = unicode whereas type(string2) = string, i.e., string1 is a proper Python unicode string object whereas string2 is a utf-8 encoded proper Python string object.
Or consider the following Python session:
[EMAIL PROTECTED] ritz]$ python Python 2.3.3 (#5, Dec 30 2003, 15:25:24) [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s1 = u"δόφ" >>> print repr(s1) u'\xc3\xa4\xc3\xbc\xc3\xb6' >>> s2 = s1.encode('utf-8') >>> print repr(s2) '\xc3\x83\xc2\xa4\xc3\x83\xc2\xbc\xc3\x83\xc2\xb6' >>> type(s1) <type 'unicode'> >>> type(s2) <type 'str'> >>>
Maybe that clarifies things a bit? <end:quote>
Zope-Dev maillist - [EMAIL PROTECTED]
** No cross posts or HTML encoding! **
(Related lists - http://mail.zope.org/mailman/listinfo/zope-announce