Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-03-01 Thread Paul Winkler
On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote:
 It's been years since I dug into this, but I'm better than 90% sure  
 that the browser is expected to make its requests in the encoding of  
 the response (i.e., the one set by Content-Type).  It's been too long  
 for me to tell you if that's in a spec or if it is simply the de  
 facto rule, though I suspect the former.

That almost makes sense, except that the first request precedes the
first response :) I'll have to dig into this some more when I have
time...

-- 

Paul Winkler
http://www.slinkp.com
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-03-01 Thread Jeff Shell

On 3/1/07, Paul Winkler [EMAIL PROTECTED] wrote:

On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote:
 It's been years since I dug into this, but I'm better than 90% sure
 that the browser is expected to make its requests in the encoding of
 the response (i.e., the one set by Content-Type).  It's been too long
 for me to tell you if that's in a spec or if it is simply the de
 facto rule, though I suspect the former.

That almost makes sense, except that the first request precedes the
first response :) I'll have to dig into this some more when I have
time...


By first request do you mean first form-submission? You have to do a
request to get the form. When the server sends the form, the HTTP
response containing the form should have a content type.

If the form to be submitted has an accept-charset attribute explicitly
declared, that should become the value of the Accept-Charset header.
If that field is absent, it's supposed to be understood as a special
value, 'UNKNOWN', which means that the browser or other user-agent may
submit (I don't remember if the spec says MAY or SHOULD, but I know it
doesn't say MUST) the response in the same character set as the form's
page.

I did a fair amount of spec reading and zope.publisher.http/browser
entrail reading yesterday, can you tell? :)

Anyways, without adding accept_charset to the form, this is what
Firefox sent on a form submission request's Accept-Charset header::

   ISO-8859-1,utf-8;q=0.7,*;q=0.7

Zope turned that into::

   ['utf-8', 'iso-8859-1', '*']

Zope gives UTF-8 priority over everything. The Accept-Charset header,
if present on the request, is used to establish the response character
set unless explicitly stated otherwise (or the response isn't text).
So I guess if my Firefox is sending that same accept-charset header to
Zope on each request, it will get a UTF-8 response every time (again,
unless explicitly made otherwise). If it is supposed to submit POSTs
in the same character set that it received, then it should be sending
UTF-8 each time. Hunh.

So if you had form ... accept_charset=cp437, then the browser
should send only cp437 in the Accept-Charset header and Zope should
only try to decode from that character set; and the succeeding
response should be encoded in cp437 as well. I think. That seems to be
the best I can figure out between the HTML 4.01 and HTTP 1.1 specs and
zope.publisher's http/browser request and response handlers. It seems
unlikely that you would ever need to use accept_charset like this,
though; at least not in Zope which does a good job of doing all of
this encoding/decoding work.

Well, all of this is good to finally know. This has been a mysterious
black box to me for such a long time, and it turns out that I don't
need to worry about it.

The lessons I've learned for text, as they apply to my own code, are thus:

- Work in unicode, not strings; then you won't have to worry about collisions
 between unicode and strings ('ab' + u'cdé') raising decode errors.

- When working with text, decode strings to unicode instead of encoding
 unicode to strings. I was forcably **encoding** my unicode objects when I'd
 be building up long strings, which came from my confusion over
 encode/decode. This is how I'd lose my extended characters and end up with
 garbage output.

- Be alert to what other text processing tools such as the Python
 implementations of Textile and Markdown want as input and return as output.
 In my ignorance, I wasn't paying attention to the fact that I needed to
 decode the results back to unicode, and I believe this was another systemic
 central point of pain, torture, and failure for my apps. And in my ignorance
 I tried to fix the errors that I saw with forcable *encoding* instead of
 *decoding*, which is why I would see garbage characters show up in
 certain situations. I now realize this is the right way to work with those
 tools::

   rendered = textile(content.encode('utf-8'), encoding='utf-8',
  output='utf-8')
   return rendered.decode('utf-8')

Does that all sound right?

--
Jeff Shell
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-02-28 Thread Jeff Shell

On 2/28/07, Philipp von Weitershausen [EMAIL PROTECTED] wrote:

Jeff Shell wrote:
 - Not have any encode / decode errors. 'ascii codec doesn't recognize
 character ... at position ...'. I don't want to keep on bullying
 through whenever this pops up.

You can't just simply do str(some_unicode) or unicode(some_str), unless
you really know that you're only dealing with the ASCII subset in both
cases. Use explicit encodings to convert.

Now, the trick is obviously to know the encoding. A 'str' object is
worth squat if you don't know the encoding that goes along with it. In
other words, (some_str, encoding) is isomorph to a unicode object.


Ahh. I finally get this now. I was casting back and forth with wild
abandon in some key places - in one particular place I was doing wild
encoding somersaults when I really meant to be doing a small set of
decode tries. I think this is why I was seeing customer garbage: I was
turning unicode into strs and back again long before the final
response was all built up.


  - HOW do I know what a browser has sent me? There doesn't seem to be
 a real way of handling this. Do I guess?

That's sorta what zope.publisher does. Actually, it figures that if the
browser sends an Accept-Charset header, the stuff that its sending to us
would be encoded in one of those encodings, so it tries the ones in
Accept-Charset until it's lucky. It falls back to UTF-8.

This seems to work. But yeah, it's relying on implementation details of
the browser and it's weird.


Ugh. I don't know how I missed that header. I was always looking for a
content-type on the post, hoping that it had the information.

I was finally able to confirm that Zope was handing me the data
properly; it was some of my HTML generation code that was mangling
data on output.


 But again,
 how do I know when to decode from latin-1 and when to decode from
 UTF-8? When or why should I encode to one or the other at response
 time? Should I worry at all?

If you're using Zope, you don't have to encode outgoing text at all,
unless you're setting a non-text content-type on the outgoing response.
If the context-type is text/*, you can just return unicode from your
browser view and zope.publisher will use the best encoding that the
browser prefers (from Accept-Charset). Best meaning that if the
browser accepts latin-1,utf-8 and your page contains Korean text, it'll
use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that
there's no chance to not be able to encode.


This finally made sense to me as well. I had a form with a widget
rendered by my own HTML generation code, and with a zope.app.widgets
text field. I pasted Sam Ruby's Internationalization diacritic-heavy
string into both fields. When I saw that the zope.app.widget was
rendering properly while my own field was not, that sealed it.
Unfortunately, all of my prior tests had involved my own widget, since
that is where I had seen the junk characters.

Now I ensure that my HTML generator is all unicode. Any basic string
that it encounters, which typically come from source code, is decoded
into unicode immediately. As mentioned above, I was wildly and
inappropriately encoding to strings with some forceful settings so
that I could join elements together.


I'm wondering if I make this clear enough in my book. It's always hard
to tell by myself since these things seem obvious to me. If you got any
constructive feedback regarding this, I'll be more than happy to hear it
and consequently improve the book for you Stupid Americans :).


At quick glance, I didn't see where this might have been described.
There's no mention of unicode in the back index, and from the table of
contents I didn't see much besides the chapter on internationalization
(which we're completely avoiding until we absolutely need to do it).

But this helps. Between all of the answers I've received thus far, I
finally have a grasp of what I'm doing. I'll try to codify it into a
useful document.

Thanks!

--
Jeff Shell
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-02-28 Thread Paul Winkler
On Wed, Feb 28, 2007 at 02:06:39PM -0700, Jeff Shell wrote:
 On 2/28/07, Philipp von Weitershausen [EMAIL PROTECTED] wrote:
 That's sorta what zope.publisher does. Actually, it figures that if the
 browser sends an Accept-Charset header, the stuff that its sending to us
 would be encoded in one of those encodings, so it tries the ones in
 Accept-Charset until it's lucky. It falls back to UTF-8.
 
 This seems to work. But yeah, it's relying on implementation details of
 the browser and it's weird.
 
 Ugh. I don't know how I missed that header. I was always looking for a
 content-type on the post, hoping that it had the information.

I'm rather late to this particular party, and I'm far from an expert
on either unicode or HTTP, but I have to ask: Is it just me, or is
HTTP's support for specifying encodings completely inadequate?

As far as I can tell, there are only two relevant headers.  The
request may specify Accept-Charset, whose meaning is given as what
character sets are acceptable for the *response* (emphasis mine).
The response may specify Content-Type, which again is irrelevant to
the request.  If there's anything that allows the client to specify
the encoding in use *for the request data*, I don't see it.

That seems like quite an oversight to make as late as HTTP 1.1 (1999).
What am I missing?


-PW


-- 

Paul Winkler
http://www.slinkp.com
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-02-28 Thread Gary Poster


On Feb 28, 2007, at 5:37 PM, Paul Winkler wrote:


On Wed, Feb 28, 2007 at 02:06:39PM -0700, Jeff Shell wrote:
On 2/28/07, Philipp von Weitershausen [EMAIL PROTECTED]  
wrote:
That's sorta what zope.publisher does. Actually, it figures that  
if the
browser sends an Accept-Charset header, the stuff that its  
sending to us

would be encoded in one of those encodings, so it tries the ones in
Accept-Charset until it's lucky. It falls back to UTF-8.

This seems to work. But yeah, it's relying on implementation  
details of

the browser and it's weird.


Ugh. I don't know how I missed that header. I was always looking  
for a

content-type on the post, hoping that it had the information.


I'm rather late to this particular party, and I'm far from an expert
on either unicode or HTTP, but I have to ask: Is it just me, or is
HTTP's support for specifying encodings completely inadequate?

As far as I can tell, there are only two relevant headers.  The
request may specify Accept-Charset, whose meaning is given as what
character sets are acceptable for the *response* (emphasis mine).
The response may specify Content-Type, which again is irrelevant to
the request.  If there's anything that allows the client to specify
the encoding in use *for the request data*, I don't see it.

That seems like quite an oversight to make as late as HTTP 1.1 (1999).
What am I missing?


It's been years since I dug into this, but I'm better than 90% sure  
that the browser is expected to make its requests in the encoding of  
the response (i.e., the one set by Content-Type).  It's been too long  
for me to tell you if that's in a spec or if it is simply the de  
facto rule, though I suspect the former.


Gary


___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users