Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

Jeff Shell Thu, 01 Mar 2007 12:20:11 -0800

On 3/1/07, Paul Winkler <[EMAIL PROTECTED]> wrote:

On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote:
> It's been years since I dug into this, but I'm better than 90% sure
> that the browser is expected to make its requests in the encoding of
> the response (i.e., the one set by Content-Type).  It's been too long
> for me to tell you if that's in a spec or if it is simply the de
> facto rule, though I suspect the former.


That almost makes sense, except that the first request precedes the
first response :) I'll have to dig into this some more when I have
time...


By first request do you mean first form-submission? You have to do a
request to get the form. When the server sends the form, the HTTP
response containing the form should have a content type.

If the form to be submitted has an accept-charset attribute explicitly
declared, that should become the value of the Accept-Charset header.
If that field is absent, it's supposed to be understood as a special
value, 'UNKNOWN', which means that the browser or other user-agent may
submit (I don't remember if the spec says MAY or SHOULD, but I know it
doesn't say MUST) the response in the same character set as the form's
page.

I did a fair amount of spec reading and zope.publisher.http/browser
entrail reading yesterday, can you tell? :)

Anyways, without adding accept_charset to the form, this is what
Firefox sent on a form submission request's Accept-Charset header::

   ISO-8859-1,utf-8;q=0.7,*;q=0.7

Zope turned that into::

   ['utf-8', 'iso-8859-1', '*']

Zope gives UTF-8 priority over everything. The Accept-Charset header,
if present on the request, is used to establish the response character
set unless explicitly stated otherwise (or the response isn't text).
So I guess if my Firefox is sending that same accept-charset header to
Zope on each request, it will get a UTF-8 response every time (again,
unless explicitly made otherwise). If it is supposed to submit POSTs
in the same character set that it received, then it should be sending
UTF-8 each time. Hunh.

So if you had <form ... accept_charset="cp437">, then the browser
should send only cp437 in the Accept-Charset header and Zope should
only try to decode from that character set; and the succeeding
response should be encoded in cp437 as well. I think. That seems to be
the best I can figure out between the HTML 4.01 and HTTP 1.1 specs and
zope.publisher's http/browser request and response handlers. It seems
unlikely that you would ever need to use accept_charset like this,
though; at least not in Zope which does a good job of doing all of
this encoding/decoding work.

Well, all of this is good to finally know. This has been a mysterious
black box to me for such a long time, and it turns out that I don't
need to worry about it.

The lessons I've learned for text, as they apply to my own code, are thus:

- Work in unicode, not strings; then you won't have to worry about collisions
 between unicode and strings ('ab' + u'cdé') raising decode errors.

- When working with text, decode strings to unicode instead of encoding
 unicode to strings. I was forcably **encoding** my unicode objects when I'd
 be building up long strings, which came from my confusion over
 encode/decode. This is how I'd lose my extended characters and end up with
 garbage output.

- Be alert to what other text processing tools such as the Python
 implementations of Textile and Markdown want as input and return as output.
 In my ignorance, I wasn't paying attention to the fact that I needed to
 decode the results back to unicode, and I believe this was another systemic
 central point of pain, torture, and failure for my apps. And in my ignorance
 I tried to fix the errors that I saw with forcable *encoding* instead of
 *decoding*, which is why I would see garbage characters show up in
 certain situations. I now realize this is the right way to work with those
 tools::

       rendered = textile(content.encode('utf-8'), encoding='utf-8',
                          output='utf-8')
       return rendered.decode('utf-8')

Does that all sound right?

--
Jeff Shell
_______________________________________________
Zope3-users mailing list
[email protected]
http://mail.zope.org/mailman/listinfo/zope3-users

Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

Reply via email to