On Saturday, July 17, 2010, Graham Dumpleton <graham.dumple...@gmail.com> wrote: > On Saturday, July 17, 2010, Ian Bicking <i...@colorstudy.com> wrote: >> On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby <p...@telecommunity.com> wrote: >> >> >> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote: >> >> And this doesn't help with Python 3: either we have byte values of >> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values. I think >> bytes will be more awkward to port to than text, and inconsistent with other >> WSGI values. >> >> >> OTOH, it has the tremendous advantage of pushing the encoding question onto >> the app (or framework) developer... who's really the only one who can make >> the right decision for their particular application. And personally, I'd >> rather have clear boundaries between text and bytes, such that porting (even >> if tedious or awkward) is *consistent*, and clear as to when you're >> finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and >> PATH_INFO... not just in my app code, but in all the library code I call >> *from* my app?" >> >> IOW, the bytes/string discussion on Python-dev has kind of led me to realize >> that we might just as well make the *entire* stack bytes (incoming and >> outgoing headers *and* streams), and rewrite that bit in PEP 333 about using >> str on "Python 3000" to say we go with bytes on Python 3+ for everything >> that's a str in today's WSGI. >> >> This was my first intuition too, until I started thinking in more detail >> about the particular values involved. Some obviously are textish, like >> environ['SERVER_NAME']. Not a very useful value, but definitely text. >> >> Basically all the internal strings are textish, so we're left with: >> >> wsgi.url_scheme >> SCRIPT_NAME/PATH_INFO >> QUERY_STRING >> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers) >> response status >> response headers (name and value) >> >> And there's a few things like REMOTE_USER that are kind of in the middle. >> Everyone is in agreement that bodies should be bytes. >> >> One initial problem is that the Python 3 stdlib handles bytes poorly, so for >> instance there's no good way to reconstruct the URL using the stdlib. That >> explains certain tensions, but I think we should ignore that, and in fact >> that's what Python-Dev seemed to say pretty clearly. >> >> Now, the other keys: >> >> wsgi.url_scheme: clearly ASCII >> >> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old >> legacy encoding. >> raw request path: should be ASCII (non-ASCII should be URL-encoded). URL >> encoding happens at the byte layer, so a server could reasonably URL encode >> any non-ASCII characters without imposing any encoding. >> >> QUERY_STRING: should be ASCII, same as raw request path >> >> headers: Most are ASCII. Latin1 is a reasonable fallback and suggested by >> the specification. The spec also implies you have use the RFC2047 inline >> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and >> supporting it would probably be a bad idea for security reasons. The >> Atompub spec (reasonably modern) specifically says Title headers should be >> encoded with RFC2047 (if they are not ISO-8859-1): >> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- >> decoding this kind of encoding at the application layer seems reasonable to >> me. >> >> cookie header: this specific header can easily have multiple encodings, as >> the browser encodes data then treats it as opaque bytes, so a cookie can be >> set via UTF-8 one place, Latin1 another, and those coexist in one header. >> That is, there is no real encoding and this should be treated as bytes. >> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but >> entirely workable.) >> >> response status: I believe the spec says this must be Latin1/ISO-8859-1. In >> practice it is almost always ASCII, and since it is not user-visible it's >> not something that really needs localization. >> >> response headers: the spec implies Latin1, in practice the Set-Cookie header >> is bytes (since interoperation with wonky legacy systems is not uncommon). >> I'm not sure of any other exceptions? >> >> >> So... to me it seems pretty reasonable for HTTP specifically that text can >> work. And if feels weird that, say, environ['SERVER_NAME'] be text and >> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] >> should be in that mode. And it would also be weird if >> environ['SERVER_NAME'] was bytes. >> >> In the past when we've gotten down to specifics, the only holdup has been >> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those. > > There were a few other weird ones which are though server specific. > For example PATH_TRANSLATED (??). These are ones where again the > server or operating system dictates the encoding due to them having > bits in them deriving from things like filesystem paths and server > configuration files. I laboriously went through all these in an email > last year or earlier. > > Same reason why SCRIPT_NAME is really dictated by server and raw value > perhaps should be going through to application.
s/should/shouldn't/ Graham _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com