Jeff Shell wrote:
I continue to feel like an idiot in the face of Unicode. I finally
understand what a unicode 'string' really is, and what encode and
decode mean (they were previously interchangable in my mind). But I
don't know the best practices.
My desire is to:
- Not have any encode / decode errors. 'ascii codec doesn't recognize
character ... at position ...'. I don't want to keep on bullying
through whenever this pops up.
You can't just simply do str(some_unicode) or unicode(some_str), unless
you really know that you're only dealing with the ASCII subset in both
cases. Use explicit encodings to convert.
Now, the trick is obviously to know the encoding. A 'str' object is
worth squat if you don't know the encoding that goes along with it. In
other words, (some_str, encoding) is isomorph to a unicode object.
- Not turn customer input into garbage. It may render to the public
site fine, but sometimes in the admin skin's text areas, things turn
funky. I don't know if there's something I need to do at form-handling
time, or at rendering time, or what... I did a test based on a
document by Sam Ruby, and guess that I'm often getting Latin-1 from
our customers, which doesn't map to UTF-8 (the diacritic marks go
- HOW do I know what a browser has sent me? There doesn't seem to be
a real way of handling this. Do I guess?
That's sorta what zope.publisher does. Actually, it figures that if the
browser sends an Accept-Charset header, the stuff that its sending to us
would be encoded in one of those encodings, so it tries the ones in
Accept-Charset until it's lucky. It falls back to UTF-8.
This seems to work. But yeah, it's relying on implementation details of
the browser and it's weird.
- Know without a doubt when to encode, and when to decode. I guess the
"proper" thing to do is to store everything as unicode, and to decode
to unicode as early as possible when input is coming in.
how do I know when to decode from latin-1 and when to decode from
UTF-8? When or why should I encode to one or the other at response
time? Should I worry at all?
If you're using Zope, you don't have to encode outgoing text at all,
unless you're setting a non-text content-type on the outgoing response.
If the context-type is text/*, you can just return unicode from your
browser view and zope.publisher will use the best encoding that the
browser prefers (from Accept-Charset). "Best" meaning that if the
browser accepts latin-1,utf-8 and your page contains Korean text, it'll
use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that
there's no chance to not be able to encode.
You can, of course, encode yourself in the browser view. You can pick
pretty much any encoding you like, all you have to do is tell the
browser about it in the response header (Content-Type:
If there are any documents, web pages, Zope 3 book chapters, and past
messages that I may have missed or need to look at in more detail,
please let me know. I've had a hard time sifting through all of the
information, and I apoligize if I've missed something written by
I'm wondering if I make this clear enough in my book. It's always hard
to tell by myself since these things seem obvious to me. If you got any
constructive feedback regarding this, I'll be more than happy to hear it
and consequently improve the book for you "Stupid Americans" :).
http://worldcookery.com -- Professional Zope documentation and training
Next Zope 3 training at Camp5: http://trizpug.org/boot-camp/camp5
Zope3-users mailing list