Re: [Web-SIG] WSGI, Python 3 and Unicode
[Alan] The restriction to iso-8859-1 is really a distraction; iso-8859-1 is used simply as an identity encoding that also enforces that all bytes in the string have a value from 0x00 to 0xff, so that they are suitable for byte-oriented IO. So, in output terms at least, WSGI *is* a byte-oriented protocol. The problem is the python-the-language didn't have support for bytes at the time WSGI was designed. [Thomas] If you're talking about the output stream, then yes, it's all about bytes (or should be). Indeed, I was only talking about output, specifically the response body. But at the status and headers level, HTTP/1.1 is fundamentally ISO-8859-1-encoded. Agreed. That is why the WSGI spec also states Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding. So in order to use non-ISO-8859-1 characters in response status strings or headers, you must use RFC 2047. As confirmed by the links you posted, this is a HTTP restriction, not a WSGI restriction. Regards, Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
[Phillip] WSGI already copes, actually. Note that Jython and IronPython have this issue today, and see: http://www.python.org/dev/peps/pep-0333/#unicode-issues [James] It would seem very odd, however, for WSGI/python3 to use strings- restricted-to-0xFF for network I/O while everywhere else in python3 is going to use bytes for the same purpose. I think it's worth pointing out the reason for the current restriction to iso-8859-1 is *because* python did not have a bytes type at the time the WSGI spec was drawn up. IIRC, the bytes type had not yet even been proposed for Py3K. Cpython effectively held all byte sequences as strings, a paradigm which is (still) followed by jython (not sure about ironpython). The restriction to iso-8859-1 is really a distraction; iso-8859-1 is used simply as an identity encoding that also enforces that all bytes in the string have a value from 0x00 to 0xff, so that they are suitable for byte-oriented IO. So, in output terms at least, WSGI *is* a byte-oriented protocol. The problem is the python-the-language didn't have support for bytes at the time WSGI was designed. [James] You'd have to modify your app to call write(unicodetext.encode('utf-8').decode('latin-1')) or so Did you mean: write(unicodetext.encode('utf-8').encode('latin-1'))? Either way, the second encode is not required; write(unicodetext.encode('utf-8')) is sufficient, since it will generate a byte-sequence(string) which will (actually should: see (*) note below) pass the following test. try: wsgi_response_data.encode('iso-8859-1') except UnicodeError: # Illegal WSGI response data! On a side note, it's worth noting that Philip Jenvey's excellent rework of the jython IO subsystem to use java.nio is fundamentally byte oriented. http://www.nabble.com/fileno-support-is-not-in-jython.-Reason--t4750734.html http://fisheye3.cenqua.com/browse/jython/trunk/jython/src/org/python/core/io Because it is based on the new IO design for Python 3K, as described in PEP 3116 http://www.python.org/dev/peps/pep-3116/ Regards, Alan. [*] Although I notice that cpython 2.5, for a reason I don't fully understand, fails this particular encoding sequence. (Maybe it's to do with the possibility that the result of an encode operation is no longer an encodable string?) Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. response = uinterferon-gamma (IFN-\u03b3) responses in cattle response.encode('utf-8').encode('latin-1') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 22: ordinal not in range(128) Meaning that to enforce the WSGI iso-8859-1 convention on cpython 2.5, you would have to carry out this rigmarole response.encode('utf-8').decode('latin-1').encode('latin-1') 'interferon-gamma (IFN-\xce\xb3) responses in cattle' Perhaps this behaviour is an artifact of the cpython implementation? Whereas jython passes it just fine (and correctly, IMHO) Jython 2.2.1 on java1.4.2_15 Type copyright, credits or license for more information. response = uinterferon-gamma (IFN-\u03b3) responses in cattle response.encode('utf-8') 'interferon-gamma (IFN-\xCE\xB3) responses in cattle' response.encode('utf-8').encode('latin-1') 'interferon-gamma (IFN-\xCE\xB3) responses in cattle' ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
Phillip J. Eby wrote: So here are my recommendations so far for the addendum to WSGI *1.0* for Python 3.0 (I expect we can be more strict for WSGI 2.0): * When running under Python 3, applications SHOULD produce bytes output and headers * When running under Python 3, servers and gateways MUST accept strings as application output or headers, under the existing rules (i.e., s.encode('latin-1') must convert the string to bytes without an exception) * When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables that should NOT be strings?) I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not latin1. That is, after you urldecode the values (as WSGI asks you to do) proper conversion to text is to decode it as UTF8. I'm a bit confused on how HTTP_COOKIE gets encoded. And QUERY_STRING also confuses me. Is this all compatible with os.environ in py3k? I don't care that much if it does, but as the starting point for CGI it would be interesting if it stays in sync. -- Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
So here are my recommendations so far for the addendum to WSGI *1.0* for Python 3.0 (I expect we can be more strict for WSGI 2.0): * When running under Python 3, applications SHOULD produce bytes output and headers * When running under Python 3, servers and gateways MUST accept strings as application output or headers, under the existing rules (i.e., s.encode('latin-1') must convert the string to bytes without an exception) * When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables that should NOT be strings?) * When running under Python 3, servers MUST make wsgi.input a binary (byte) stream * When running under Python 3, servers MUST provide a text stream for wsgi.errors These rules are intended to simplify the porting of existing code. Notice, for example, that these rules allow middleware to pass strings through unchanged, since they are not required to produce bytes output or headers. Unfortunately, wsgi.input can't be coded around, but for most frameworks this should be a single point of pain. In fact, if the 'cgi' stdlib module is made compatible with bytes, only the rare framework that rolls its own multipart parser or otherwise directly manipulates put/post data will be affected. Code that just takes the input and writes it to a file won't be bothered, either. Comments or questions? ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 7, 2007, at 5:46 PM, Andrew Clover wrote: OTOH making the dictionaries reflect the underlying OS's conception of environment variables means users of os.environ and WSGI will have to be able to cope with both bytes and unicode, which would also be a big annoyance. In summary: urgh, this is all messy and 'orrible. I suppose this is more a question for python-dev, but, it'd be really nice if Python on Windows made it look like the windows system encoding was always UTF-8. That is, bytestrings used for open/ os.environ/argv/etc. are always encoded/decoded in utf-8, not the broken-platform-encoding. Then the same code would work just as well on unix as it does on windows. Actually, I bet I could implement that today, just by wrapping some stuffhmmm... James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
James Y Knight wrote: In addition, I know of nobody who actually implements RFC 2047 decoding of http header values...nothing really uses it. (of course I don't know of all implementations out there.) Certainly no browser supports it, which makes the point moot for WSGI. Most browsers, when quoting a header parameter, simply encode using the previous page's charset and put quotes around it... even if the parameter has a quote or control codes in it. Ian wrote: Is this all compatible with os.environ in py3k? In 3.0a2 os.environ has Unicode strings for both keys and values. This is correct for Windows where environment variables are explicitly Unicode, but questionable (IMO) for Unix where they're really bytes that may or may not represent decodeable Unicode strings. SCRIPT_NAME/PATH_INFO This already causes problems in Windows CGI applications! Because these are passed in environment variables, IIS* has to decode the submitted bytes to Unicode first. It seems always to choose UTF-8 for this job, which I suppose is the least bad guess, but hardly infallible. (* - haven't tested this with Apache for Windows yet.) In Python 2.x, os.environ being byte strings, Python/the C library then has to encode them back to bytes, which I believe ends up using the system codepage. Since the system codepage is never UTF-8 on Windows this means not only that the bytes read back from eg. PATH_INFO are not the same as the original bytes submitted to the web server, but that if there are characters outside the system codepage submitted, they'll be unrecoverable. If os.environ remains Unicode in Unix and WSGI follows it (as it must if CGI-invoked WSGI is to continue working smoothly), webapps that try to allow for non-ASCII characters in URLs are likely to get some nasty deployment problems that depend on the system encoding setting, something that will be particularly troublesome for end-users to debug and fix. OTOH making the dictionaries reflect the underlying OS's conception of environment variables means users of os.environ and WSGI will have to be able to cope with both bytes and unicode, which would also be a big annoyance. In summary: urgh, this is all messy and 'orrible. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
Adam Atlas [EMAIL PROTECTED] wrote: I'd say it would be best to only accept `bytes` objects +1. HTTP is inherently byte-based. Any translation between bytes and unicode characters should be done at a higher level, by whatever web framework is living above WSGI. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 7, 2007, at 2:55 PM, Phillip J. Eby wrote: * When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables that should NOT be strings?) A WSGI gateway should *not* decode headers using RFC 2047. It actually *cannot*, without knowing the structure of that particular header, because only TEXT tokens are encoded that way. In addition, I know of nobody who actually implements RFC 2047 decoding of http header values...nothing really uses it. (of course I don't know of all implementations out there.) On Dec 7, 2007, at 3:24 PM, Ian Bicking wrote: I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not latin1. That is, after you urldecode the values (as WSGI asks you to do) proper conversion to text is to decode it as UTF8. Surely not! URLs aren't always utf-8 encoded, only often. James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com