[Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

2004-04-26 Thread David Convent
Hi Bjorn,

I always believed that unicode and utf-8 were same encoding, but reading 
you let me think i was wrong.
Can you tell me what the difference is between unicode and utf-8 ?

Bjorn Stabell wrote:

While we're all waiting for Zope 3 and Plone 3, I'd like to know what the
standard practice way of using Unicode with Zope 2.  In particular, we'd
like to store all text as Unicode in the ZODB, and have Zope do the
encoding/decoding as automatically and transparently as possible.
We've been using Zope 2's ZPublisher to do this encoding/decoding for over 2
years, and it's working fine.  We just have to ensure that we set the
appropriate encoding in a HTTP Content-type header, and that we add
:utext/ustring:ENCODING to HTML form field names.  Regardless of what you
may have heard, THIS WORKS FINE!  We also store Unicode, not UTF-8 (or other
encodings), strings in the ZODB.
The problem we're running into are with other components, basically making
our Unicode-with-Zope experience, shall we say, less than ecstatic (To put
it this way, I seem to lose hair much faster when dealing with Unicode
problems :)   I'm wondering why components/products aren't all relying on
the ZPublisher for Unicode encoding/decoding?  Is there another standard
way?
Here is a summary of what we've found:

ZMI
* gets charset from manage_page_charset encoding
* relies on ZPublisher for encoding (but doesn't do decoding, see below)
* in PropertyManager you can add ustrings, but since it doesn't add
:ENCODING to the field names, you get a Unicode error when trying to save
since it tries to decode the text assuming ASCII (big problem)
* DTML Methods/Documents: doesn't support Unicode (annoying)
* can't use Unicode id's (not a big problem)
Archetypes:
* gets charset from portal_url.getCharset() or
portal_properties.site_properties.default_charset
* doesn't rely on ZPublisher, does its own encoding/decoding
* returns encoded strings, not Unicode strings, to Zope apps, leading to
problems such as:
 - SearcableText() encodes, and as such can't be used with Unicode-aware
ZCatalogs
 - transform() encodes
   (and because of that SearchableText() sometimes decodes/encodes 2 times
instead of 0 times)
 - get()ing field values will encode them, so if you want Unicode, you have
to decode yourself
   (adding both unnecessary overhead for data access, and unnecessary
dependency on the global variable for the charset)
Plone:
* no special Unicode support for HTML forms; relies on Archetypes
Formulator:
* gets charset from manage_page_charset (same as ZMI), but can be overridden
* stores field values as encoded text (not Unicode), but lets you specify
which encoding to use
 (confusingly calls this unicode mode)
* messages are stored as UTF-8 (hardcoded)
I suggest this way of dealing with Unicode right now in Zope 2:

(1) Let ZPublisher do the encoding/decoding of form input and HTML output:

 a. Always set a character encoding in a HTTP Content-type request

 b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of
fields that support Unicode
 (we may need some library code to make this easier)
(2) Store Unicode strings directly in the ZODB.  The ZODB is perfectly
capable of storing strings in Python's internal Unicode format; no need to
encode the text to UTF-8 or some other encoding.
(3) Encode/decode yourself when reading from/ writing to other external data
sources such as files and other databases.  Do it just before you write, or
just after you read, so that as much code as possible can be
encoding-agnostic.  Keep the encoding/decoding as close to the source data
as possible.   The best way to do it is (in most cases) to specify the
encoding on the IO stream, and let Python do the encoding/decoding for you
transparently.  If possible, get the encoding from the external data source
(e.g., the file) instead of relying on a magical global variable.  If you
have to rely on a global variable, let it be manage_page_charset.
(4) [This is really just advice...] Resist patching your code to work with
components that doesn't deal with Unicode.  Others are likely having the
same problem, so to avoid ending up with lots of ugly patches (that are the
source of mysterious Unicode problems), fix the problem at its source: the
other component.  It's really not that difficult to fix (if we agree on how
it should be fixed ;)
None of the above components handles Unicode in this way, but it seems to be
how the Unicode support in Zope 2 was meant to be used.  Let me know if
there is another better way, but please do let me know...  I think we need
to resolve this once and for all or I know some people that'll just go mad
(or bald, or both) :)
I'll be willing to contribute patches, but since this applies to so many
products, it would be good to get some consensus first.  At the very least,
can we create a Standard Unicode Practices page?
Bye,
 



--
David Convent
___
Zope-Dev maillist  -  [EMAIL PROTECTED]

[Zope-dev] RE: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

2004-04-26 Thread Bjorn Stabell
 --On Montag, 26. April 2004 10:53 Uhr +0200 David Convent 
 [EMAIL PROTECTED] wrote:
 
  I always believed that unicode and utf-8 were same encoding, but 
  reading you let me think i was wrong.
  Can you tell me what the difference is between unicode and utf-8 ?

Andreas Jung wrote: 
 Unicode is common database for almost all characters. UTF-8 
 is an *encoding* that allows you to represent any element of 
 this character database as set for 1,2,3 or 4 bytes. There 
 are also other encoding e.g. like UTF16 that encode an 
 element in a different wayso we are talking about 
 completely different things.

Yes, the difference is that Python has a whole different understanding of
Unicode strings (type(u)) than it has of text of some character encoding
(e.g., UTF-8, GB18030, ISO8859-1, ASCII, stored as type()).  Python will
of course represent these unicode strings internally some way (maybe as a
16-bit integer?), but we don't need to know what that is like.  All we need
to know is that this is a string that can contain any character on the
planet, and that we can reasonably expect normal text operations to work on
it.

UTF-8 is, similar to ISO-8869-1 (latin1), just a character encoding.  It
(and UTF16, UCS2, UCS4) is only special in that it was issued by the Unicode
consortium and can encode any Unicode character.  Wherease ISO-8859-1 (for
example), being only 8 bits, can only encode characters used in Western
Europe.  GB18030, to take another extreme, is a 32-bit encoding endorsed by
the Chinese govnerment; being 32-bit, it can encode/represent a lot of
Unicode characters, even many non-Chinese ones; it is big enough to
potentially encode any Unicode character, if the Chinese government defined
how each Unicode code point was mapped into GB18030.  In this case, it would
be similar in function to UCS4 (I think it is).

Internally, we want to work with Unicode strings (where str[4] is the 4th
character) instead of UTF-8 encoded text strings (where str[4], being the
4th byte, has little semantic meaning).

Bye,
-- 
Bjorn



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )


[Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

2004-04-26 Thread Martijn Faassen
Bjorn Stabell wrote:

Formulator:
* gets charset from manage_page_charset (same as ZMI), but can be overridden
* stores field values as encoded text (not Unicode), but lets you specify
which encoding to use
  (confusingly calls this unicode mode)
* messages are stored as UTF-8 (hardcoded)
While there is no question about the confusingness of the user interface 
of Formulator pertaining unicode, most of this is not correct (unless 
there are bugs I don't know about).

Formulator has two modes; unicode mode and 'classic mode'. In unicode 
mode, all strings are stored as Python unicode strings. In classic mode, 
all strings are stored in 'whatever encoding the user is using'. It's 
possible to convert from one mode to another, and for this switching 
behavior an encoding to use can be specified. In unicode mode, that 
encoding is ignored, however.

Classic mode basically exists so as not to break all Formulator forms 
already in existence. This complicated the design significantly, but I 
thought this was important.

Quite independently from this, fields can also be configured to 
*deliver* unicode upon validation/conversion. The character set is 
specified of the page that the form is in can be specified in the form 
settings.

I suggest this way of dealing with Unicode right now in Zope 2:
General note: this way sounds good to me, but I know from hard 
experience how difficult it is to convert an existing application to 
fully unicode.

(1) Let ZPublisher do the encoding/decoding of form input and HTML output:

  a. Always set a character encoding in a HTTP Content-type request
Silva does this (and Formulator too).

  b. Always append :ustring/utext/ulines/utokens:ENCODING to field names of
fields that support Unicode
  (we may need some library code to make this easier)
Formulator won't be able to do 'b' very easily. It'll do its own 
converting to unicode though for fields that want this.

(2) Store Unicode strings directly in the ZODB.  The ZODB is perfectly
capable of storing strings in Python's internal Unicode format; no need to
encode the text to UTF-8 or some other encoding.
Silva has been doing this fully since version 0.9.2, released in the 
summer of last year. Formulator took a while longer to catch up (before 
it would only interoperate if the form titles etc were only ascii), but 
is now a first class citizen in a Zope/unicode environment. Its XML 
serialization is UTF-8 in this mode.

(3) Encode/decode yourself when reading from/ writing to other external data
sources such as files and other databases.  Do it just before you write, or
just after you read, so that as much code as possible can be
encoding-agnostic.  Keep the encoding/decoding as close to the source data
as possible.   The best way to do it is (in most cases) to specify the
encoding on the IO stream, and let Python do the encoding/decoding for you
transparently.  If possible, get the encoding from the external data source
(e.g., the file) instead of relying on a magical global variable.  If you
have to rely on a global variable, let it be manage_page_charset.
(4) [This is really just advice...] Resist patching your code to work with
components that doesn't deal with Unicode.  Others are likely having the
same problem, so to avoid ending up with lots of ugly patches (that are the
source of mysterious Unicode problems), fix the problem at its source: the
other component.  It's really not that difficult to fix (if we agree on how
it should be fixed ;)
It's actually quite difficult to fix if you care about backwards 
compatibility. Fixing Formulator was quite complicated. You're 
definitely making this sound far easier than it is. It's a good thing to 
do, Silva has it, but the words 'not that difficult' don't fit in this 
debate.

None of the above components handles Unicode in this way, but it seems to be
how the Unicode support in Zope 2 was meant to be used. 
You're actually wrong about Formulator. :)

Regards,

Martijn



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope )


Re: [Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

2004-04-26 Thread Martijn Faassen
David Convent wrote:
Hi Bjorn,

I always believed that unicode and utf-8 were same encoding, but reading 
you let me think i was wrong.
Can you tell me what the difference is between unicode and utf-8 ?
Unicode should not be seen as an encoding as such. While Python 
internally uses an encoding for unicode strings (which are the strings 
that if you represent them python will add a 'u' in front of them), you 
shouldn't care about what that is, and Python can in fact be recompiled 
to use another.

UTF-8 is one particular way to represent unicode data, in this case as 8 
bit strings. UTF-8 happens to be popular for two (related) reasons:

  * since UTF-8 includes ASCII, ASCII is automatically UTF-8 and UTF-8 
without a lot of special characters looks like ASCII.

  * Software that can deal with 8 bit strings can usually deal with UTF-8.

Anyway, in my experience most programmers have only a vague grasp of 
encoding issues. The basics are in Python not that hard to understand, but:

  * Python is not very educational if you do it wrong; you basically 
get weird errors

  * you get weird errors frequently in a different place in the code 
than where you made them, when some code is trying to combine unicode 
strings with classic strings.

  * you can 'hack' your way around it and survive for a long time. You 
don't notice the problem as it works with the test text which happens to 
be ascii. Etc.

Regards,

Martijn

___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope )


[Zope-dev] RE: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

2004-04-26 Thread Bjorn Stabell
  None of the above components handles Unicode in this way, 
  but it seems to be how the Unicode support in Zope 2 was meant to be
used.

Martijn wrote:
 You're actually wrong about Formulator. :)

Apologies.  We were using older versions of Formulator before, and I was
just doing code inspection of the new version when I concluded the above
about Formulator.  One less component to worry about :)

Bye,
-- 
Bjorn



___
Zope-Dev maillist  -  [EMAIL PROTECTED]
http://mail.zope.org/mailman/listinfo/zope-dev
**  No cross posts or HTML encoding!  **
(Related lists - 
 http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope )