On 2/28/07, Philipp von Weitershausen <[EMAIL PROTECTED]> wrote:
Jeff Shell wrote:
> - Not have any encode / decode errors. 'ascii codec doesn't recognize
> character ... at position ...'. I don't want to keep on bullying
> through whenever this pops up.
You can't just simply do str(some_unicode) or unicode(some_str), unless
you really know that you're only dealing with the ASCII subset in both
cases. Use explicit encodings to convert.
Now, the trick is obviously to know the encoding. A 'str' object is
worth squat if you don't know the encoding that goes along with it. In
other words, (some_str, encoding) is isomorph to a unicode object.
Ahh. I finally get this now. I was casting back and forth with wild
abandon in some key places - in one particular place I was doing wild
encoding somersaults when I really meant to be doing a small set of
decode tries. I think this is why I was seeing customer garbage: I was
turning unicode into strs and back again long before the final
response was all built up.
> - HOW do I know what a browser has sent me? There doesn't seem to be
> a real way of handling this. Do I guess?
That's sorta what zope.publisher does. Actually, it figures that if the
browser sends an Accept-Charset header, the stuff that its sending to us
would be encoded in one of those encodings, so it tries the ones in
Accept-Charset until it's lucky. It falls back to UTF-8.
This seems to work. But yeah, it's relying on implementation details of
the browser and it's weird.
Ugh. I don't know how I missed that header. I was always looking for a
content-type on the post, hoping that it had the information.
I was finally able to confirm that Zope was handing me the data
properly; it was some of my HTML generation code that was mangling
data on output.
> But again,
> how do I know when to decode from latin-1 and when to decode from
> UTF-8? When or why should I encode to one or the other at response
> time? Should I worry at all?
If you're using Zope, you don't have to encode outgoing text at all,
unless you're setting a non-text content-type on the outgoing response.
If the context-type is text/*, you can just return unicode from your
browser view and zope.publisher will use the best encoding that the
browser prefers (from Accept-Charset). "Best" meaning that if the
browser accepts latin-1,utf-8 and your page contains Korean text, it'll
use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that
there's no chance to not be able to encode.
This finally made sense to me as well. I had a form with a widget
rendered by my own HTML generation code, and with a zope.app.widgets
text field. I pasted Sam Ruby's "Internationalization" diacritic-heavy
string into both fields. When I saw that the zope.app.widget was
rendering properly while my own field was not, that sealed it.
Unfortunately, all of my prior tests had involved my own widget, since
that is where I had seen the junk characters.
Now I ensure that my HTML generator is all unicode. Any basic string
that it encounters, which typically come from source code, is decoded
into unicode immediately. As mentioned above, I was wildly and
inappropriately encoding to strings with some forceful settings so
that I could join elements together.
I'm wondering if I make this clear enough in my book. It's always hard
to tell by myself since these things seem obvious to me. If you got any
constructive feedback regarding this, I'll be more than happy to hear it
and consequently improve the book for you "Stupid Americans" :).
At quick glance, I didn't see where this might have been described.
There's no mention of unicode in the back index, and from the table of
contents I didn't see much besides the chapter on internationalization
(which we're completely avoiding until we absolutely need to do it).
But this helps. Between all of the answers I've received thus far, I
finally have a grasp of what I'm doing. I'll try to codify it into a
Zope3-users mailing list