Jeff Shell wrote:
I continue to feel like an idiot in the face of Unicode. I finally
understand what a unicode 'string' really is, and what encode and
decode mean (they were previously interchangable in my mind). But I
don't know the best practices.

My desire is to:

- Not have any encode / decode errors. 'ascii codec doesn't recognize
character ... at position ...'. I don't want to keep on bullying
through whenever this pops up.

You can't just simply do str(some_unicode) or unicode(some_str), unless you really know that you're only dealing with the ASCII subset in both cases. Use explicit encodings to convert.

Now, the trick is obviously to know the encoding. A 'str' object is worth squat if you don't know the encoding that goes along with it. In other words, (some_str, encoding) is isomorph to a unicode object.

- Not turn customer input into garbage. It may render to the public
site fine, but sometimes in the admin skin's text areas, things turn
funky. I don't know if there's something I need to do at form-handling
time, or at rendering time, or what... I did a test based on a
document by Sam Ruby, and guess that I'm often getting Latin-1 from
our customers, which doesn't map to UTF-8 (the diacritic marks go

 - HOW do I know what a browser has sent me? There doesn't seem to be
a real way of handling this. Do I guess?

That's sorta what zope.publisher does. Actually, it figures that if the browser sends an Accept-Charset header, the stuff that its sending to us would be encoded in one of those encodings, so it tries the ones in Accept-Charset until it's lucky. It falls back to UTF-8.

This seems to work. But yeah, it's relying on implementation details of the browser and it's weird.

- Know without a doubt when to encode, and when to decode. I guess the
"proper" thing to do is to store everything as unicode, and to decode
to unicode as early as possible when input is coming in.

Absolutely correct.

But again,
how do I know when to decode from latin-1 and when to decode from
UTF-8? When or why should I encode to one or the other at response
time? Should I worry at all?

If you're using Zope, you don't have to encode outgoing text at all, unless you're setting a non-text content-type on the outgoing response. If the context-type is text/*, you can just return unicode from your browser view and zope.publisher will use the best encoding that the browser prefers (from Accept-Charset). "Best" meaning that if the browser accepts latin-1,utf-8 and your page contains Korean text, it'll use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that there's no chance to not be able to encode.

You can, of course, encode yourself in the browser view. You can pick pretty much any encoding you like, all you have to do is tell the browser about it in the response header (Content-Type: foo/bar;charset=your-encoding).

If there are any documents, web pages, Zope 3 book chapters, and past
messages that I may have missed or need to look at in more detail,
please let me know. I've had a hard time sifting through all of the
information, and I apoligize if I've missed something written by
anyone here.

I'm wondering if I make this clear enough in my book. It's always hard to tell by myself since these things seem obvious to me. If you got any constructive feedback regarding this, I'll be more than happy to hear it and consequently improve the book for you "Stupid Americans" :).


-- -- Professional Zope documentation and training
Next Zope 3 training at Camp5:

Zope3-users mailing list

Reply via email to