Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-03-01 Thread Jeff Shell

On 3/1/07, Paul Winkler <[EMAIL PROTECTED]> wrote:

On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote:
> It's been years since I dug into this, but I'm better than 90% sure
> that the browser is expected to make its requests in the encoding of
> the response (i.e., the one set by Content-Type).  It's been too long
> for me to tell you if that's in a spec or if it is simply the de
> facto rule, though I suspect the former.

That almost makes sense, except that the first request precedes the
first response :) I'll have to dig into this some more when I have
time...


By first request do you mean first form-submission? You have to do a
request to get the form. When the server sends the form, the HTTP
response containing the form should have a content type.

If the form to be submitted has an accept-charset attribute explicitly
declared, that should become the value of the Accept-Charset header.
If that field is absent, it's supposed to be understood as a special
value, 'UNKNOWN', which means that the browser or other user-agent may
submit (I don't remember if the spec says MAY or SHOULD, but I know it
doesn't say MUST) the response in the same character set as the form's
page.

I did a fair amount of spec reading and zope.publisher.http/browser
entrail reading yesterday, can you tell? :)

Anyways, without adding accept_charset to the form, this is what
Firefox sent on a form submission request's Accept-Charset header::

   ISO-8859-1,utf-8;q=0.7,*;q=0.7

Zope turned that into::

   ['utf-8', 'iso-8859-1', '*']

Zope gives UTF-8 priority over everything. The Accept-Charset header,
if present on the request, is used to establish the response character
set unless explicitly stated otherwise (or the response isn't text).
So I guess if my Firefox is sending that same accept-charset header to
Zope on each request, it will get a UTF-8 response every time (again,
unless explicitly made otherwise). If it is supposed to submit POSTs
in the same character set that it received, then it should be sending
UTF-8 each time. Hunh.

So if you had , then the browser
should send only cp437 in the Accept-Charset header and Zope should
only try to decode from that character set; and the succeeding
response should be encoded in cp437 as well. I think. That seems to be
the best I can figure out between the HTML 4.01 and HTTP 1.1 specs and
zope.publisher's http/browser request and response handlers. It seems
unlikely that you would ever need to use accept_charset like this,
though; at least not in Zope which does a good job of doing all of
this encoding/decoding work.

Well, all of this is good to finally know. This has been a mysterious
black box to me for such a long time, and it turns out that I don't
need to worry about it.

The lessons I've learned for text, as they apply to my own code, are thus:

- Work in unicode, not strings; then you won't have to worry about collisions
 between unicode and strings ('ab' + u'cdé') raising decode errors.

- When working with text, decode strings to unicode instead of encoding
 unicode to strings. I was forcably **encoding** my unicode objects when I'd
 be building up long strings, which came from my confusion over
 encode/decode. This is how I'd lose my extended characters and end up with
 garbage output.

- Be alert to what other text processing tools such as the Python
 implementations of Textile and Markdown want as input and return as output.
 In my ignorance, I wasn't paying attention to the fact that I needed to
 decode the results back to unicode, and I believe this was another systemic
 central point of pain, torture, and failure for my apps. And in my ignorance
 I tried to fix the errors that I saw with forcable *encoding* instead of
 *decoding*, which is why I would see garbage characters show up in
 certain situations. I now realize this is the right way to work with those
 tools::

   rendered = textile(content.encode('utf-8'), encoding='utf-8',
  output='utf-8')
   return rendered.decode('utf-8')

Does that all sound right?

--
Jeff Shell
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-03-01 Thread Paul Winkler
On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote:
> It's been years since I dug into this, but I'm better than 90% sure  
> that the browser is expected to make its requests in the encoding of  
> the response (i.e., the one set by Content-Type).  It's been too long  
> for me to tell you if that's in a spec or if it is simply the de  
> facto rule, though I suspect the former.

That almost makes sense, except that the first request precedes the
first response :) I'll have to dig into this some more when I have
time...

-- 

Paul Winkler
http://www.slinkp.com
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


[Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-03-01 Thread Raphael Ritz

Jeff Shell schrieb:
[..]


I feel like I know enough to squeak by, but that's no longer
acceptable. Sometimes I quiver in terror, waiting for everything to
fall down because of something so seemingly basic like strings/text.
It may be a lot of technical debt, or it may be extremely easy to pay
down. In any case, it's time to pay it down. :)



In addition to what has already been said (and maybe not really
what you are asking for), the number one problem people encounter
can be illustrated within a simple interactive Python session
like this one:

[EMAIL PROTECTED] ~]$ python244
Python 2.4.4 (#1, Jan 29 2007, 13:00:46)
[GCC 4.1.1 20060525 (Red Hat 4.1.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s1 = u"I'm unicode"
>>> s2 = "I'm a byte string"
>>> s1 + s2
u"I'm unicodeI'm a byte string"
>>> s1 = u"I'm unicode äöü"
>>> s1
u"I'm unicode \xe4\xf6\xfc"
>>> s2 = "I'm a byte string äöü"
>>> s2
"I'm a byte string \xc3\xa4\xc3\xb6\xc3\xbc"
>>> s1 + s2
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: 
ordinal not in range(128)

>>> s1 + s2.decode('utf-8')
u"I'm unicode \xe4\xf6\xfcI'm a byte string \xe4\xf6\xfc"
>>>

It's Python's implicit casting of byte strings to unicode strings
the moment it has to deal with both together with Python's default
encoding being 'ascii' (don't ask why - it's been a long discussion
at the time ...).

The moment you get the strings involved from some application
call or third-party code you often don't really know what you
will get, so unless you do extensive checks and explicit
decoding using the right codec you'll never be on the safe side.

Within Zope 3, however, this should be a non-issue as Zope 3
(including Zope 3-based applications) should only use unicode
strings.

Just my 2 cents.

Raphael





___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-02-28 Thread Gary Poster


On Feb 28, 2007, at 5:37 PM, Paul Winkler wrote:


On Wed, Feb 28, 2007 at 02:06:39PM -0700, Jeff Shell wrote:
On 2/28/07, Philipp von Weitershausen <[EMAIL PROTECTED]>  
wrote:
That's sorta what zope.publisher does. Actually, it figures that  
if the
browser sends an Accept-Charset header, the stuff that its  
sending to us

would be encoded in one of those encodings, so it tries the ones in
Accept-Charset until it's lucky. It falls back to UTF-8.

This seems to work. But yeah, it's relying on implementation  
details of

the browser and it's weird.


Ugh. I don't know how I missed that header. I was always looking  
for a

content-type on the post, hoping that it had the information.


I'm rather late to this particular party, and I'm far from an expert
on either unicode or HTTP, but I have to ask: Is it just me, or is
HTTP's support for specifying encodings completely inadequate?

As far as I can tell, there are only two relevant headers.  The
request may specify Accept-Charset, whose meaning is given as "what
character sets are acceptable for the *response*" (emphasis mine).
The response may specify Content-Type, which again is irrelevant to
the request.  If there's anything that allows the client to specify
the encoding in use *for the request data*, I don't see it.

That seems like quite an oversight to make as late as HTTP 1.1 (1999).
What am I missing?


It's been years since I dug into this, but I'm better than 90% sure  
that the browser is expected to make its requests in the encoding of  
the response (i.e., the one set by Content-Type).  It's been too long  
for me to tell you if that's in a spec or if it is simply the de  
facto rule, though I suspect the former.


Gary


___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-02-28 Thread Paul Winkler
On Wed, Feb 28, 2007 at 02:06:39PM -0700, Jeff Shell wrote:
> On 2/28/07, Philipp von Weitershausen <[EMAIL PROTECTED]> wrote:
> >That's sorta what zope.publisher does. Actually, it figures that if the
> >browser sends an Accept-Charset header, the stuff that its sending to us
> >would be encoded in one of those encodings, so it tries the ones in
> >Accept-Charset until it's lucky. It falls back to UTF-8.
> >
> >This seems to work. But yeah, it's relying on implementation details of
> >the browser and it's weird.
> 
> Ugh. I don't know how I missed that header. I was always looking for a
> content-type on the post, hoping that it had the information.

I'm rather late to this particular party, and I'm far from an expert
on either unicode or HTTP, but I have to ask: Is it just me, or is
HTTP's support for specifying encodings completely inadequate?

As far as I can tell, there are only two relevant headers.  The
request may specify Accept-Charset, whose meaning is given as "what
character sets are acceptable for the *response*" (emphasis mine).
The response may specify Content-Type, which again is irrelevant to
the request.  If there's anything that allows the client to specify
the encoding in use *for the request data*, I don't see it.

That seems like quite an oversight to make as late as HTTP 1.1 (1999).
What am I missing?


-PW


-- 

Paul Winkler
http://www.slinkp.com
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-02-28 Thread Jeff Shell

On 2/28/07, Philipp von Weitershausen <[EMAIL PROTECTED]> wrote:

Jeff Shell wrote:
> - Not have any encode / decode errors. 'ascii codec doesn't recognize
> character ... at position ...'. I don't want to keep on bullying
> through whenever this pops up.

You can't just simply do str(some_unicode) or unicode(some_str), unless
you really know that you're only dealing with the ASCII subset in both
cases. Use explicit encodings to convert.

Now, the trick is obviously to know the encoding. A 'str' object is
worth squat if you don't know the encoding that goes along with it. In
other words, (some_str, encoding) is isomorph to a unicode object.


Ahh. I finally get this now. I was casting back and forth with wild
abandon in some key places - in one particular place I was doing wild
encoding somersaults when I really meant to be doing a small set of
decode tries. I think this is why I was seeing customer garbage: I was
turning unicode into strs and back again long before the final
response was all built up.


>  - HOW do I know what a browser has sent me? There doesn't seem to be
> a real way of handling this. Do I guess?

That's sorta what zope.publisher does. Actually, it figures that if the
browser sends an Accept-Charset header, the stuff that its sending to us
would be encoded in one of those encodings, so it tries the ones in
Accept-Charset until it's lucky. It falls back to UTF-8.

This seems to work. But yeah, it's relying on implementation details of
the browser and it's weird.


Ugh. I don't know how I missed that header. I was always looking for a
content-type on the post, hoping that it had the information.

I was finally able to confirm that Zope was handing me the data
properly; it was some of my HTML generation code that was mangling
data on output.


> But again,
> how do I know when to decode from latin-1 and when to decode from
> UTF-8? When or why should I encode to one or the other at response
> time? Should I worry at all?

If you're using Zope, you don't have to encode outgoing text at all,
unless you're setting a non-text content-type on the outgoing response.
If the context-type is text/*, you can just return unicode from your
browser view and zope.publisher will use the best encoding that the
browser prefers (from Accept-Charset). "Best" meaning that if the
browser accepts latin-1,utf-8 and your page contains Korean text, it'll
use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that
there's no chance to not be able to encode.


This finally made sense to me as well. I had a form with a widget
rendered by my own HTML generation code, and with a zope.app.widgets
text field. I pasted Sam Ruby's "Internationalization" diacritic-heavy
string into both fields. When I saw that the zope.app.widget was
rendering properly while my own field was not, that sealed it.
Unfortunately, all of my prior tests had involved my own widget, since
that is where I had seen the junk characters.

Now I ensure that my HTML generator is all unicode. Any basic string
that it encounters, which typically come from source code, is decoded
into unicode immediately. As mentioned above, I was wildly and
inappropriately encoding to strings with some forceful settings so
that I could join elements together.


I'm wondering if I make this clear enough in my book. It's always hard
to tell by myself since these things seem obvious to me. If you got any
constructive feedback regarding this, I'll be more than happy to hear it
and consequently improve the book for you "Stupid Americans" :).


At quick glance, I didn't see where this might have been described.
There's no mention of unicode in the back index, and from the table of
contents I didn't see much besides the chapter on internationalization
(which we're completely avoiding until we absolutely need to do it).

But this helps. Between all of the answers I've received thus far, I
finally have a grasp of what I'm doing. I'll try to codify it into a
useful document.

Thanks!

--
Jeff Shell
___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users


[Zope3-Users] Re: Unicode for Stupid Americans (like me)?

2007-02-28 Thread Philipp von Weitershausen

Jeff Shell wrote:

I continue to feel like an idiot in the face of Unicode. I finally
understand what a unicode 'string' really is, and what encode and
decode mean (they were previously interchangable in my mind). But I
don't know the best practices.

My desire is to:

- Not have any encode / decode errors. 'ascii codec doesn't recognize
character ... at position ...'. I don't want to keep on bullying
through whenever this pops up.


You can't just simply do str(some_unicode) or unicode(some_str), unless 
you really know that you're only dealing with the ASCII subset in both 
cases. Use explicit encodings to convert.


Now, the trick is obviously to know the encoding. A 'str' object is 
worth squat if you don't know the encoding that goes along with it. In 
other words, (some_str, encoding) is isomorph to a unicode object.



- Not turn customer input into garbage. It may render to the public
site fine, but sometimes in the admin skin's text areas, things turn
funky. I don't know if there's something I need to do at form-handling
time, or at rendering time, or what... I did a test based on a
document by Sam Ruby, and guess that I'm often getting Latin-1 from
our customers, which doesn't map to UTF-8 (the diacritic marks go
haywire).

 - HOW do I know what a browser has sent me? There doesn't seem to be
a real way of handling this. Do I guess?


That's sorta what zope.publisher does. Actually, it figures that if the 
browser sends an Accept-Charset header, the stuff that its sending to us 
would be encoded in one of those encodings, so it tries the ones in 
Accept-Charset until it's lucky. It falls back to UTF-8.


This seems to work. But yeah, it's relying on implementation details of 
the browser and it's weird.



- Know without a doubt when to encode, and when to decode. I guess the
"proper" thing to do is to store everything as unicode, and to decode
to unicode as early as possible when input is coming in.


Absolutely correct.


But again,
how do I know when to decode from latin-1 and when to decode from
UTF-8? When or why should I encode to one or the other at response
time? Should I worry at all?


If you're using Zope, you don't have to encode outgoing text at all, 
unless you're setting a non-text content-type on the outgoing response. 
If the context-type is text/*, you can just return unicode from your 
browser view and zope.publisher will use the best encoding that the 
browser prefers (from Accept-Charset). "Best" meaning that if the 
browser accepts latin-1,utf-8 and your page contains Korean text, it'll 
use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that 
there's no chance to not be able to encode.


You can, of course, encode yourself in the browser view. You can pick 
pretty much any encoding you like, all you have to do is tell the 
browser about it in the response header (Content-Type: 
foo/bar;charset=your-encoding).



If there are any documents, web pages, Zope 3 book chapters, and past
messages that I may have missed or need to look at in more detail,
please let me know. I've had a hard time sifting through all of the
information, and I apoligize if I've missed something written by
anyone here.


I'm wondering if I make this clear enough in my book. It's always hard 
to tell by myself since these things seem obvious to me. If you got any 
constructive feedback regarding this, I'll be more than happy to hear it 
and consequently improve the book for you "Stupid Americans" :).


HTH

--
http://worldcookery.com -- Professional Zope documentation and training
Next Zope 3 training at Camp5: http://trizpug.org/boot-camp/camp5

___
Zope3-users mailing list
Zope3-users@zope.org
http://mail.zope.org/mailman/listinfo/zope3-users