[Zope3-Users] Re: Unicode for Stupid Americans (like me)?

Philipp von Weitershausen Wed, 28 Feb 2007 08:39:58 -0800

Jeff Shell wrote:

I continue to feel like an idiot in the face of Unicode. I finally
understand what a unicode 'string' really is, and what encode and
decode mean (they were previously interchangable in my mind). But I
don't know the best practices.


My desire is to:

- Not have any encode / decode errors. 'ascii codec doesn't recognize
character ... at position ...'. I don't want to keep on bullying
through whenever this pops up.

You can't just simply do str(some_unicode) or unicode(some_str), unlessyou really know that you're only dealing with the ASCII subset in bothcases. Use explicit encodings to convert.

Now, the trick is obviously to know the encoding. A 'str' object isworth squat if you don't know the encoding that goes along with it. Inother words, (some_str, encoding) is isomorph to a unicode object.

- Not turn customer input into garbage. It may render to the public
site fine, but sometimes in the admin skin's text areas, things turn
funky. I don't know if there's something I need to do at form-handling
time, or at rendering time, or what... I did a test based on a
document by Sam Ruby, and guess that I'm often getting Latin-1 from
our customers, which doesn't map to UTF-8 (the diacritic marks go
haywire).

 - HOW do I know what a browser has sent me? There doesn't seem to be
a real way of handling this. Do I guess?

That's sorta what zope.publisher does. Actually, it figures that if thebrowser sends an Accept-Charset header, the stuff that its sending to uswould be encoded in one of those encodings, so it tries the ones inAccept-Charset until it's lucky. It falls back to UTF-8.

This seems to work. But yeah, it's relying on implementation details ofthe browser and it's weird.

- Know without a doubt when to encode, and when to decode. I guess the
"proper" thing to do is to store everything as unicode, and to decode
to unicode as early as possible when input is coming in.


Absolutely correct.

But again,
how do I know when to decode from latin-1 and when to decode from
UTF-8? When or why should I encode to one or the other at response
time? Should I worry at all?

If you're using Zope, you don't have to encode outgoing text at all,unless you're setting a non-text content-type on the outgoing response.If the context-type is text/*, you can just return unicode from yourbrowser view and zope.publisher will use the best encoding that thebrowser prefers (from Accept-Charset). "Best" meaning that if thebrowser accepts latin-1,utf-8 and your page contains Korean text, it'lluse utf-8, not latin-1. utf-8 is always a fallback, anyway, so thatthere's no chance to not be able to encode.

You can, of course, encode yourself in the browser view. You can pickpretty much any encoding you like, all you have to do is tell thebrowser about it in the response header (Content-Type:foo/bar;charset=your-encoding).

If there are any documents, web pages, Zope 3 book chapters, and past
messages that I may have missed or need to look at in more detail,
please let me know. I've had a hard time sifting through all of the
information, and I apoligize if I've missed something written by
anyone here.

I'm wondering if I make this clear enough in my book. It's always hardto tell by myself since these things seem obvious to me. If you got anyconstructive feedback regarding this, I'll be more than happy to hear itand consequently improve the book for you "Stupid Americans" :).


HTH

--
http://worldcookery.com -- Professional Zope documentation and training
Next Zope 3 training at Camp5: http://trizpug.org/boot-camp/camp5

_______________________________________________
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users

[Zope3-Users] Re: Unicode for Stupid Americans (like me)?

Reply via email to