While we're all waiting for Zope 3 and Plone 3, I'd like to know what the "standard practice" way of using Unicode with Zope 2. In particular, we'd like to store all text as Unicode in the ZODB, and have Zope do the encoding/decoding as automatically and transparently as possible.
We've been using Zope 2's ZPublisher to do this encoding/decoding for over 2 years, and it's working fine. We just have to ensure that we set the appropriate encoding in a HTTP Content-type header, and that we add :utext/ustring:ENCODING to HTML form field names. Regardless of what you may have heard, THIS WORKS FINE! We also store Unicode, not UTF-8 (or other encodings), strings in the ZODB. The problem we're running into are with other components, basically making our Unicode-with-Zope experience, shall we say, less than ecstatic (To put it this way, I seem to lose hair much faster when dealing with Unicode problems :) I'm wondering why components/products aren't all relying on the ZPublisher for Unicode encoding/decoding? Is there another standard way? Here is a summary of what we've found: ZMI * gets charset from manage_page_charset encoding * relies on ZPublisher for encoding (but doesn't do decoding, see below) * in PropertyManager you can add ustrings, but since it doesn't add :ENCODING to the field names, you get a Unicode error when trying to save since it tries to decode the text assuming ASCII (big problem) * DTML Methods/Documents: doesn't support Unicode (annoying) * can't use Unicode id's (not a big problem) Archetypes: * gets charset from portal_url.getCharset() or portal_properties.site_properties.default_charset * doesn't rely on ZPublisher, does its own encoding/decoding * returns encoded strings, not Unicode strings, to Zope apps, leading to problems such as: - SearcableText() encodes, and as such can't be used with Unicode-aware ZCatalogs - transform() encodes (and because of that SearchableText() sometimes decodes/encodes 2 times instead of 0 times) - get()ing field values will encode them, so if you want Unicode, you have to decode yourself (adding both unnecessary overhead for data access, and unnecessary dependency on the global variable for the charset) Plone: * no special Unicode support for HTML forms; relies on Archetypes Formulator: * gets charset from manage_page_charset (same as ZMI), but can be overridden * stores field values as encoded text (not Unicode), but lets you specify which encoding to use (confusingly calls this "unicode" mode) * messages are stored as UTF-8 (hardcoded) I suggest this way of dealing with Unicode right now in Zope 2: (1) Let ZPublisher do the encoding/decoding of form input and HTML output: a. Always set a character encoding in a HTTP Content-type request b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of fields that support Unicode (we may need some library code to make this easier) (2) Store Unicode strings directly in the ZODB. The ZODB is perfectly capable of storing strings in Python's internal Unicode format; no need to encode the text to UTF-8 or some other encoding. (3) Encode/decode yourself when reading from/ writing to other external data sources such as files and other databases. Do it just before you write, or just after you read, so that as much code as possible can be encoding-agnostic. Keep the encoding/decoding as close to the "source data" as possible. The best way to do it is (in most cases) to specify the encoding on the IO stream, and let Python do the encoding/decoding for you transparently. If possible, get the encoding from the external data source (e.g., the file) instead of relying on a magical global variable. If you have to rely on a global variable, let it be manage_page_charset. (4) [This is really just advice...] Resist patching your code to work with components that doesn't deal with Unicode. Others are likely having the same problem, so to avoid ending up with lots of ugly patches (that are the source of mysterious Unicode problems), fix the problem at its source: the other component. It's really not that difficult to fix (if we agree on how it should be fixed ;) None of the above components handles Unicode in this way, but it seems to be how the Unicode support in Zope 2 was meant to be used. Let me know if there is another better way, but please do let me know... I think we need to resolve this once and for all or I know some people that'll just go mad (or bald, or both) :) I'll be willing to contribute patches, but since this applies to so many products, it would be good to get some consensus first. At the very least, can we create a "Standard Unicode Practices" page? Bye, -- Bjorn Stabell <mailto:[EMAIL PROTECTED]> _______________________________________________ Zope-Dev maillist - [EMAIL PROTECTED] http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )