Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?
On 3/1/07, Paul Winkler <[EMAIL PROTECTED]> wrote: On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote: > It's been years since I dug into this, but I'm better than 90% sure > that the browser is expected to make its requests in the encoding of > the response (i.e., the one set by Content-Type). It's been too long > for me to tell you if that's in a spec or if it is simply the de > facto rule, though I suspect the former. That almost makes sense, except that the first request precedes the first response :) I'll have to dig into this some more when I have time... By first request do you mean first form-submission? You have to do a request to get the form. When the server sends the form, the HTTP response containing the form should have a content type. If the form to be submitted has an accept-charset attribute explicitly declared, that should become the value of the Accept-Charset header. If that field is absent, it's supposed to be understood as a special value, 'UNKNOWN', which means that the browser or other user-agent may submit (I don't remember if the spec says MAY or SHOULD, but I know it doesn't say MUST) the response in the same character set as the form's page. I did a fair amount of spec reading and zope.publisher.http/browser entrail reading yesterday, can you tell? :) Anyways, without adding accept_charset to the form, this is what Firefox sent on a form submission request's Accept-Charset header:: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Zope turned that into:: ['utf-8', 'iso-8859-1', '*'] Zope gives UTF-8 priority over everything. The Accept-Charset header, if present on the request, is used to establish the response character set unless explicitly stated otherwise (or the response isn't text). So I guess if my Firefox is sending that same accept-charset header to Zope on each request, it will get a UTF-8 response every time (again, unless explicitly made otherwise). If it is supposed to submit POSTs in the same character set that it received, then it should be sending UTF-8 each time. Hunh. So if you had , then the browser should send only cp437 in the Accept-Charset header and Zope should only try to decode from that character set; and the succeeding response should be encoded in cp437 as well. I think. That seems to be the best I can figure out between the HTML 4.01 and HTTP 1.1 specs and zope.publisher's http/browser request and response handlers. It seems unlikely that you would ever need to use accept_charset like this, though; at least not in Zope which does a good job of doing all of this encoding/decoding work. Well, all of this is good to finally know. This has been a mysterious black box to me for such a long time, and it turns out that I don't need to worry about it. The lessons I've learned for text, as they apply to my own code, are thus: - Work in unicode, not strings; then you won't have to worry about collisions between unicode and strings ('ab' + u'cdé') raising decode errors. - When working with text, decode strings to unicode instead of encoding unicode to strings. I was forcably **encoding** my unicode objects when I'd be building up long strings, which came from my confusion over encode/decode. This is how I'd lose my extended characters and end up with garbage output. - Be alert to what other text processing tools such as the Python implementations of Textile and Markdown want as input and return as output. In my ignorance, I wasn't paying attention to the fact that I needed to decode the results back to unicode, and I believe this was another systemic central point of pain, torture, and failure for my apps. And in my ignorance I tried to fix the errors that I saw with forcable *encoding* instead of *decoding*, which is why I would see garbage characters show up in certain situations. I now realize this is the right way to work with those tools:: rendered = textile(content.encode('utf-8'), encoding='utf-8', output='utf-8') return rendered.decode('utf-8') Does that all sound right? -- Jeff Shell ___ Zope3-users mailing list Zope3-users@zope.org http://mail.zope.org/mailman/listinfo/zope3-users
Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?
On Wed, Feb 28, 2007 at 09:08:03PM -0500, Gary Poster wrote: > It's been years since I dug into this, but I'm better than 90% sure > that the browser is expected to make its requests in the encoding of > the response (i.e., the one set by Content-Type). It's been too long > for me to tell you if that's in a spec or if it is simply the de > facto rule, though I suspect the former. That almost makes sense, except that the first request precedes the first response :) I'll have to dig into this some more when I have time... -- Paul Winkler http://www.slinkp.com ___ Zope3-users mailing list Zope3-users@zope.org http://mail.zope.org/mailman/listinfo/zope3-users
[Zope3-Users] Re: Unicode for Stupid Americans (like me)?
Jeff Shell schrieb: [..] I feel like I know enough to squeak by, but that's no longer acceptable. Sometimes I quiver in terror, waiting for everything to fall down because of something so seemingly basic like strings/text. It may be a lot of technical debt, or it may be extremely easy to pay down. In any case, it's time to pay it down. :) In addition to what has already been said (and maybe not really what you are asking for), the number one problem people encounter can be illustrated within a simple interactive Python session like this one: [EMAIL PROTECTED] ~]$ python244 Python 2.4.4 (#1, Jan 29 2007, 13:00:46) [GCC 4.1.1 20060525 (Red Hat 4.1.1-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> s1 = u"I'm unicode" >>> s2 = "I'm a byte string" >>> s1 + s2 u"I'm unicodeI'm a byte string" >>> s1 = u"I'm unicode äöü" >>> s1 u"I'm unicode \xe4\xf6\xfc" >>> s2 = "I'm a byte string äöü" >>> s2 "I'm a byte string \xc3\xa4\xc3\xb6\xc3\xbc" >>> s1 + s2 Traceback (most recent call last): File "", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128) >>> s1 + s2.decode('utf-8') u"I'm unicode \xe4\xf6\xfcI'm a byte string \xe4\xf6\xfc" >>> It's Python's implicit casting of byte strings to unicode strings the moment it has to deal with both together with Python's default encoding being 'ascii' (don't ask why - it's been a long discussion at the time ...). The moment you get the strings involved from some application call or third-party code you often don't really know what you will get, so unless you do extensive checks and explicit decoding using the right codec you'll never be on the safe side. Within Zope 3, however, this should be a non-issue as Zope 3 (including Zope 3-based applications) should only use unicode strings. Just my 2 cents. Raphael ___ Zope3-users mailing list Zope3-users@zope.org http://mail.zope.org/mailman/listinfo/zope3-users
Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?
On Feb 28, 2007, at 5:37 PM, Paul Winkler wrote: On Wed, Feb 28, 2007 at 02:06:39PM -0700, Jeff Shell wrote: On 2/28/07, Philipp von Weitershausen <[EMAIL PROTECTED]> wrote: That's sorta what zope.publisher does. Actually, it figures that if the browser sends an Accept-Charset header, the stuff that its sending to us would be encoded in one of those encodings, so it tries the ones in Accept-Charset until it's lucky. It falls back to UTF-8. This seems to work. But yeah, it's relying on implementation details of the browser and it's weird. Ugh. I don't know how I missed that header. I was always looking for a content-type on the post, hoping that it had the information. I'm rather late to this particular party, and I'm far from an expert on either unicode or HTTP, but I have to ask: Is it just me, or is HTTP's support for specifying encodings completely inadequate? As far as I can tell, there are only two relevant headers. The request may specify Accept-Charset, whose meaning is given as "what character sets are acceptable for the *response*" (emphasis mine). The response may specify Content-Type, which again is irrelevant to the request. If there's anything that allows the client to specify the encoding in use *for the request data*, I don't see it. That seems like quite an oversight to make as late as HTTP 1.1 (1999). What am I missing? It's been years since I dug into this, but I'm better than 90% sure that the browser is expected to make its requests in the encoding of the response (i.e., the one set by Content-Type). It's been too long for me to tell you if that's in a spec or if it is simply the de facto rule, though I suspect the former. Gary ___ Zope3-users mailing list Zope3-users@zope.org http://mail.zope.org/mailman/listinfo/zope3-users
Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?
On Wed, Feb 28, 2007 at 02:06:39PM -0700, Jeff Shell wrote: > On 2/28/07, Philipp von Weitershausen <[EMAIL PROTECTED]> wrote: > >That's sorta what zope.publisher does. Actually, it figures that if the > >browser sends an Accept-Charset header, the stuff that its sending to us > >would be encoded in one of those encodings, so it tries the ones in > >Accept-Charset until it's lucky. It falls back to UTF-8. > > > >This seems to work. But yeah, it's relying on implementation details of > >the browser and it's weird. > > Ugh. I don't know how I missed that header. I was always looking for a > content-type on the post, hoping that it had the information. I'm rather late to this particular party, and I'm far from an expert on either unicode or HTTP, but I have to ask: Is it just me, or is HTTP's support for specifying encodings completely inadequate? As far as I can tell, there are only two relevant headers. The request may specify Accept-Charset, whose meaning is given as "what character sets are acceptable for the *response*" (emphasis mine). The response may specify Content-Type, which again is irrelevant to the request. If there's anything that allows the client to specify the encoding in use *for the request data*, I don't see it. That seems like quite an oversight to make as late as HTTP 1.1 (1999). What am I missing? -PW -- Paul Winkler http://www.slinkp.com ___ Zope3-users mailing list Zope3-users@zope.org http://mail.zope.org/mailman/listinfo/zope3-users
Re: [Zope3-Users] Re: Unicode for Stupid Americans (like me)?
On 2/28/07, Philipp von Weitershausen <[EMAIL PROTECTED]> wrote: Jeff Shell wrote: > - Not have any encode / decode errors. 'ascii codec doesn't recognize > character ... at position ...'. I don't want to keep on bullying > through whenever this pops up. You can't just simply do str(some_unicode) or unicode(some_str), unless you really know that you're only dealing with the ASCII subset in both cases. Use explicit encodings to convert. Now, the trick is obviously to know the encoding. A 'str' object is worth squat if you don't know the encoding that goes along with it. In other words, (some_str, encoding) is isomorph to a unicode object. Ahh. I finally get this now. I was casting back and forth with wild abandon in some key places - in one particular place I was doing wild encoding somersaults when I really meant to be doing a small set of decode tries. I think this is why I was seeing customer garbage: I was turning unicode into strs and back again long before the final response was all built up. > - HOW do I know what a browser has sent me? There doesn't seem to be > a real way of handling this. Do I guess? That's sorta what zope.publisher does. Actually, it figures that if the browser sends an Accept-Charset header, the stuff that its sending to us would be encoded in one of those encodings, so it tries the ones in Accept-Charset until it's lucky. It falls back to UTF-8. This seems to work. But yeah, it's relying on implementation details of the browser and it's weird. Ugh. I don't know how I missed that header. I was always looking for a content-type on the post, hoping that it had the information. I was finally able to confirm that Zope was handing me the data properly; it was some of my HTML generation code that was mangling data on output. > But again, > how do I know when to decode from latin-1 and when to decode from > UTF-8? When or why should I encode to one or the other at response > time? Should I worry at all? If you're using Zope, you don't have to encode outgoing text at all, unless you're setting a non-text content-type on the outgoing response. If the context-type is text/*, you can just return unicode from your browser view and zope.publisher will use the best encoding that the browser prefers (from Accept-Charset). "Best" meaning that if the browser accepts latin-1,utf-8 and your page contains Korean text, it'll use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that there's no chance to not be able to encode. This finally made sense to me as well. I had a form with a widget rendered by my own HTML generation code, and with a zope.app.widgets text field. I pasted Sam Ruby's "Internationalization" diacritic-heavy string into both fields. When I saw that the zope.app.widget was rendering properly while my own field was not, that sealed it. Unfortunately, all of my prior tests had involved my own widget, since that is where I had seen the junk characters. Now I ensure that my HTML generator is all unicode. Any basic string that it encounters, which typically come from source code, is decoded into unicode immediately. As mentioned above, I was wildly and inappropriately encoding to strings with some forceful settings so that I could join elements together. I'm wondering if I make this clear enough in my book. It's always hard to tell by myself since these things seem obvious to me. If you got any constructive feedback regarding this, I'll be more than happy to hear it and consequently improve the book for you "Stupid Americans" :). At quick glance, I didn't see where this might have been described. There's no mention of unicode in the back index, and from the table of contents I didn't see much besides the chapter on internationalization (which we're completely avoiding until we absolutely need to do it). But this helps. Between all of the answers I've received thus far, I finally have a grasp of what I'm doing. I'll try to codify it into a useful document. Thanks! -- Jeff Shell ___ Zope3-users mailing list Zope3-users@zope.org http://mail.zope.org/mailman/listinfo/zope3-users
[Zope3-Users] Re: Unicode for Stupid Americans (like me)?
Jeff Shell wrote: I continue to feel like an idiot in the face of Unicode. I finally understand what a unicode 'string' really is, and what encode and decode mean (they were previously interchangable in my mind). But I don't know the best practices. My desire is to: - Not have any encode / decode errors. 'ascii codec doesn't recognize character ... at position ...'. I don't want to keep on bullying through whenever this pops up. You can't just simply do str(some_unicode) or unicode(some_str), unless you really know that you're only dealing with the ASCII subset in both cases. Use explicit encodings to convert. Now, the trick is obviously to know the encoding. A 'str' object is worth squat if you don't know the encoding that goes along with it. In other words, (some_str, encoding) is isomorph to a unicode object. - Not turn customer input into garbage. It may render to the public site fine, but sometimes in the admin skin's text areas, things turn funky. I don't know if there's something I need to do at form-handling time, or at rendering time, or what... I did a test based on a document by Sam Ruby, and guess that I'm often getting Latin-1 from our customers, which doesn't map to UTF-8 (the diacritic marks go haywire). - HOW do I know what a browser has sent me? There doesn't seem to be a real way of handling this. Do I guess? That's sorta what zope.publisher does. Actually, it figures that if the browser sends an Accept-Charset header, the stuff that its sending to us would be encoded in one of those encodings, so it tries the ones in Accept-Charset until it's lucky. It falls back to UTF-8. This seems to work. But yeah, it's relying on implementation details of the browser and it's weird. - Know without a doubt when to encode, and when to decode. I guess the "proper" thing to do is to store everything as unicode, and to decode to unicode as early as possible when input is coming in. Absolutely correct. But again, how do I know when to decode from latin-1 and when to decode from UTF-8? When or why should I encode to one or the other at response time? Should I worry at all? If you're using Zope, you don't have to encode outgoing text at all, unless you're setting a non-text content-type on the outgoing response. If the context-type is text/*, you can just return unicode from your browser view and zope.publisher will use the best encoding that the browser prefers (from Accept-Charset). "Best" meaning that if the browser accepts latin-1,utf-8 and your page contains Korean text, it'll use utf-8, not latin-1. utf-8 is always a fallback, anyway, so that there's no chance to not be able to encode. You can, of course, encode yourself in the browser view. You can pick pretty much any encoding you like, all you have to do is tell the browser about it in the response header (Content-Type: foo/bar;charset=your-encoding). If there are any documents, web pages, Zope 3 book chapters, and past messages that I may have missed or need to look at in more detail, please let me know. I've had a hard time sifting through all of the information, and I apoligize if I've missed something written by anyone here. I'm wondering if I make this clear enough in my book. It's always hard to tell by myself since these things seem obvious to me. If you got any constructive feedback regarding this, I'll be more than happy to hear it and consequently improve the book for you "Stupid Americans" :). HTH -- http://worldcookery.com -- Professional Zope documentation and training Next Zope 3 training at Camp5: http://trizpug.org/boot-camp/camp5 ___ Zope3-users mailing list Zope3-users@zope.org http://mail.zope.org/mailman/listinfo/zope3-users