[Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)
Hi Bjorn, I always believed that unicode and utf-8 were same encoding, but reading you let me think i was wrong. Can you tell me what the difference is between unicode and utf-8 ? Bjorn Stabell wrote: While we're all waiting for Zope 3 and Plone 3, I'd like to know what the standard practice way of using Unicode with Zope 2. In particular, we'd like to store all text as Unicode in the ZODB, and have Zope do the encoding/decoding as automatically and transparently as possible. We've been using Zope 2's ZPublisher to do this encoding/decoding for over 2 years, and it's working fine. We just have to ensure that we set the appropriate encoding in a HTTP Content-type header, and that we add :utext/ustring:ENCODING to HTML form field names. Regardless of what you may have heard, THIS WORKS FINE! We also store Unicode, not UTF-8 (or other encodings), strings in the ZODB. The problem we're running into are with other components, basically making our Unicode-with-Zope experience, shall we say, less than ecstatic (To put it this way, I seem to lose hair much faster when dealing with Unicode problems :) I'm wondering why components/products aren't all relying on the ZPublisher for Unicode encoding/decoding? Is there another standard way? Here is a summary of what we've found: ZMI * gets charset from manage_page_charset encoding * relies on ZPublisher for encoding (but doesn't do decoding, see below) * in PropertyManager you can add ustrings, but since it doesn't add :ENCODING to the field names, you get a Unicode error when trying to save since it tries to decode the text assuming ASCII (big problem) * DTML Methods/Documents: doesn't support Unicode (annoying) * can't use Unicode id's (not a big problem) Archetypes: * gets charset from portal_url.getCharset() or portal_properties.site_properties.default_charset * doesn't rely on ZPublisher, does its own encoding/decoding * returns encoded strings, not Unicode strings, to Zope apps, leading to problems such as: - SearcableText() encodes, and as such can't be used with Unicode-aware ZCatalogs - transform() encodes (and because of that SearchableText() sometimes decodes/encodes 2 times instead of 0 times) - get()ing field values will encode them, so if you want Unicode, you have to decode yourself (adding both unnecessary overhead for data access, and unnecessary dependency on the global variable for the charset) Plone: * no special Unicode support for HTML forms; relies on Archetypes Formulator: * gets charset from manage_page_charset (same as ZMI), but can be overridden * stores field values as encoded text (not Unicode), but lets you specify which encoding to use (confusingly calls this unicode mode) * messages are stored as UTF-8 (hardcoded) I suggest this way of dealing with Unicode right now in Zope 2: (1) Let ZPublisher do the encoding/decoding of form input and HTML output: a. Always set a character encoding in a HTTP Content-type request b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of fields that support Unicode (we may need some library code to make this easier) (2) Store Unicode strings directly in the ZODB. The ZODB is perfectly capable of storing strings in Python's internal Unicode format; no need to encode the text to UTF-8 or some other encoding. (3) Encode/decode yourself when reading from/ writing to other external data sources such as files and other databases. Do it just before you write, or just after you read, so that as much code as possible can be encoding-agnostic. Keep the encoding/decoding as close to the source data as possible. The best way to do it is (in most cases) to specify the encoding on the IO stream, and let Python do the encoding/decoding for you transparently. If possible, get the encoding from the external data source (e.g., the file) instead of relying on a magical global variable. If you have to rely on a global variable, let it be manage_page_charset. (4) [This is really just advice...] Resist patching your code to work with components that doesn't deal with Unicode. Others are likely having the same problem, so to avoid ending up with lots of ugly patches (that are the source of mysterious Unicode problems), fix the problem at its source: the other component. It's really not that difficult to fix (if we agree on how it should be fixed ;) None of the above components handles Unicode in this way, but it seems to be how the Unicode support in Zope 2 was meant to be used. Let me know if there is another better way, but please do let me know... I think we need to resolve this once and for all or I know some people that'll just go mad (or bald, or both) :) I'll be willing to contribute patches, but since this applies to so many products, it would be good to get some consensus first. At the very least, can we create a Standard Unicode Practices page? Bye, -- David Convent ___ Zope-Dev maillist - [EMAIL PROTECTED]
[Zope-dev] RE: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)
--On Montag, 26. April 2004 10:53 Uhr +0200 David Convent [EMAIL PROTECTED] wrote: I always believed that unicode and utf-8 were same encoding, but reading you let me think i was wrong. Can you tell me what the difference is between unicode and utf-8 ? Andreas Jung wrote: Unicode is common database for almost all characters. UTF-8 is an *encoding* that allows you to represent any element of this character database as set for 1,2,3 or 4 bytes. There are also other encoding e.g. like UTF16 that encode an element in a different wayso we are talking about completely different things. Yes, the difference is that Python has a whole different understanding of Unicode strings (type(u)) than it has of text of some character encoding (e.g., UTF-8, GB18030, ISO8859-1, ASCII, stored as type()). Python will of course represent these unicode strings internally some way (maybe as a 16-bit integer?), but we don't need to know what that is like. All we need to know is that this is a string that can contain any character on the planet, and that we can reasonably expect normal text operations to work on it. UTF-8 is, similar to ISO-8869-1 (latin1), just a character encoding. It (and UTF16, UCS2, UCS4) is only special in that it was issued by the Unicode consortium and can encode any Unicode character. Wherease ISO-8859-1 (for example), being only 8 bits, can only encode characters used in Western Europe. GB18030, to take another extreme, is a 32-bit encoding endorsed by the Chinese govnerment; being 32-bit, it can encode/represent a lot of Unicode characters, even many non-Chinese ones; it is big enough to potentially encode any Unicode character, if the Chinese government defined how each Unicode code point was mapped into GB18030. In this case, it would be similar in function to UCS4 (I think it is). Internally, we want to work with Unicode strings (where str[4] is the 4th character) instead of UTF-8 encoded text strings (where str[4], being the 4th byte, has little semantic meaning). Bye, -- Bjorn ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
[Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)
Bjorn Stabell wrote: Formulator: * gets charset from manage_page_charset (same as ZMI), but can be overridden * stores field values as encoded text (not Unicode), but lets you specify which encoding to use (confusingly calls this unicode mode) * messages are stored as UTF-8 (hardcoded) While there is no question about the confusingness of the user interface of Formulator pertaining unicode, most of this is not correct (unless there are bugs I don't know about). Formulator has two modes; unicode mode and 'classic mode'. In unicode mode, all strings are stored as Python unicode strings. In classic mode, all strings are stored in 'whatever encoding the user is using'. It's possible to convert from one mode to another, and for this switching behavior an encoding to use can be specified. In unicode mode, that encoding is ignored, however. Classic mode basically exists so as not to break all Formulator forms already in existence. This complicated the design significantly, but I thought this was important. Quite independently from this, fields can also be configured to *deliver* unicode upon validation/conversion. The character set is specified of the page that the form is in can be specified in the form settings. I suggest this way of dealing with Unicode right now in Zope 2: General note: this way sounds good to me, but I know from hard experience how difficult it is to convert an existing application to fully unicode. (1) Let ZPublisher do the encoding/decoding of form input and HTML output: a. Always set a character encoding in a HTTP Content-type request Silva does this (and Formulator too). b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of fields that support Unicode (we may need some library code to make this easier) Formulator won't be able to do 'b' very easily. It'll do its own converting to unicode though for fields that want this. (2) Store Unicode strings directly in the ZODB. The ZODB is perfectly capable of storing strings in Python's internal Unicode format; no need to encode the text to UTF-8 or some other encoding. Silva has been doing this fully since version 0.9.2, released in the summer of last year. Formulator took a while longer to catch up (before it would only interoperate if the form titles etc were only ascii), but is now a first class citizen in a Zope/unicode environment. Its XML serialization is UTF-8 in this mode. (3) Encode/decode yourself when reading from/ writing to other external data sources such as files and other databases. Do it just before you write, or just after you read, so that as much code as possible can be encoding-agnostic. Keep the encoding/decoding as close to the source data as possible. The best way to do it is (in most cases) to specify the encoding on the IO stream, and let Python do the encoding/decoding for you transparently. If possible, get the encoding from the external data source (e.g., the file) instead of relying on a magical global variable. If you have to rely on a global variable, let it be manage_page_charset. (4) [This is really just advice...] Resist patching your code to work with components that doesn't deal with Unicode. Others are likely having the same problem, so to avoid ending up with lots of ugly patches (that are the source of mysterious Unicode problems), fix the problem at its source: the other component. It's really not that difficult to fix (if we agree on how it should be fixed ;) It's actually quite difficult to fix if you care about backwards compatibility. Fixing Formulator was quite complicated. You're definitely making this sound far easier than it is. It's a good thing to do, Silva has it, but the words 'not that difficult' don't fit in this debate. None of the above components handles Unicode in this way, but it seems to be how the Unicode support in Zope 2 was meant to be used. You're actually wrong about Formulator. :) Regards, Martijn ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)
David Convent wrote: Hi Bjorn, I always believed that unicode and utf-8 were same encoding, but reading you let me think i was wrong. Can you tell me what the difference is between unicode and utf-8 ? Unicode should not be seen as an encoding as such. While Python internally uses an encoding for unicode strings (which are the strings that if you represent them python will add a 'u' in front of them), you shouldn't care about what that is, and Python can in fact be recompiled to use another. UTF-8 is one particular way to represent unicode data, in this case as 8 bit strings. UTF-8 happens to be popular for two (related) reasons: * since UTF-8 includes ASCII, ASCII is automatically UTF-8 and UTF-8 without a lot of special characters looks like ASCII. * Software that can deal with 8 bit strings can usually deal with UTF-8. Anyway, in my experience most programmers have only a vague grasp of encoding issues. The basics are in Python not that hard to understand, but: * Python is not very educational if you do it wrong; you basically get weird errors * you get weird errors frequently in a different place in the code than where you made them, when some code is trying to combine unicode strings with classic strings. * you can 'hack' your way around it and survive for a long time. You don't notice the problem as it works with the test text which happens to be ascii. Etc. Regards, Martijn ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
[Zope-dev] RE: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)
None of the above components handles Unicode in this way, but it seems to be how the Unicode support in Zope 2 was meant to be used. Martijn wrote: You're actually wrong about Formulator. :) Apologies. We were using older versions of Formulator before, and I was just doing code inspection of the new version when I concluded the above about Formulator. One less component to worry about :) Bye, -- Bjorn ___ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )