On Saturday, July 17, 2010, Graham Dumpleton <graham.dumple...@gmail.com> wrote:
> On Saturday, July 17, 2010, Ian Bicking <i...@colorstudy.com> wrote:
>> On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby <p...@telecommunity.com> wrote:
>>
>>
>> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>>
>> And this doesn't help with Python 3: either we have byte values of 
>> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think 
>> bytes will be more awkward to port to than text, and inconsistent with other 
>> WSGI values.
>>
>>
>> OTOH, it has the tremendous advantage of pushing the encoding question onto 
>> the app (or framework) developer...  who's really the only one who can make 
>> the right decision for their particular application.  And personally, I'd 
>> rather have clear boundaries between text and bytes, such that porting (even 
>> if tedious or awkward) is *consistent*, and clear as to when you're 
>> finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and 
>> PATH_INFO...  not just in my app code, but in all the library code I call 
>> *from* my app?"
>>
>> IOW, the bytes/string discussion on Python-dev has kind of led me to realize 
>> that we might just as well make the *entire* stack bytes (incoming and 
>> outgoing headers *and* streams), and rewrite that bit in PEP 333 about using 
>> str on "Python 3000" to say we go with bytes on Python 3+ for everything 
>> that's a str in today's WSGI.
>>
>> This was my first intuition too, until I started thinking in more detail 
>> about the particular values involved.  Some obviously are textish, like 
>> environ['SERVER_NAME'].  Not a very useful value, but definitely text.
>>
>> Basically all the internal strings are textish, so we're left with:
>>
>> wsgi.url_scheme
>> SCRIPT_NAME/PATH_INFO
>> QUERY_STRING
>> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
>> response status
>> response headers (name and value)
>>
>> And there's a few things like REMOTE_USER that are kind of in the middle.  
>> Everyone is in agreement that bodies should be bytes.
>>
>> One initial problem is that the Python 3 stdlib handles bytes poorly, so for 
>> instance there's no good way to reconstruct the URL using the stdlib.  That 
>> explains certain tensions, but I think we should ignore that, and in fact 
>> that's what Python-Dev seemed to say pretty clearly.
>>
>> Now, the other keys:
>>
>> wsgi.url_scheme: clearly ASCII
>>
>> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old 
>> legacy encoding.
>> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL 
>> encoding happens at the byte layer, so a server could reasonably URL encode 
>> any non-ASCII characters without imposing any  encoding.
>>
>> QUERY_STRING: should be ASCII, same as raw request path
>>
>> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by 
>> the specification.  The spec also implies you have use the RFC2047 inline 
>> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and 
>> supporting it would probably be a bad idea for security reasons.  The 
>> Atompub spec (reasonably modern) specifically says Title headers should be 
>> encoded with RFC2047 (if they are not ISO-8859-1): 
>> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- 
>> decoding this kind of encoding at the application layer seems reasonable to 
>> me.
>>
>> cookie header: this specific header can easily have multiple encodings, as 
>> the browser encodes data then treats it as opaque bytes, so a cookie can be 
>> set via UTF-8 one place, Latin1 another, and those coexist in one header.  
>> That is, there is no real encoding and this should be treated as bytes.  
>> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but 
>> entirely workable.)
>>
>> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In 
>> practice it is almost always ASCII, and since it is not user-visible it's 
>> not something that really needs localization.
>>
>> response headers: the spec implies Latin1, in practice the Set-Cookie header 
>> is bytes (since interoperation with wonky legacy systems is not uncommon).  
>> I'm not sure of any other exceptions?
>>
>>
>> So... to me it seems pretty reasonable for HTTP specifically that text can 
>> work.  And if feels weird that, say, environ['SERVER_NAME'] be text and 
>> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] 
>> should be in that mode.  And it would also be weird if 
>> environ['SERVER_NAME'] was bytes.
>>
>> In the past when we've gotten down to specifics, the only holdup has been 
>> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.
>
> There were a few other weird ones which are though server specific.
> For example PATH_TRANSLATED (??). These are ones where again the
> server or operating system dictates the encoding due to them having
> bits in them deriving from things like filesystem paths and server
> configuration files. I laboriously went through all these in an email
> last year or earlier.
>
> Same reason why SCRIPT_NAME is really dictated by server and raw value
> perhaps should be going through to application.

s/should/shouldn't/

Graham
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to