Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 9, 2007 7:56 PM, Graham Dumpleton [EMAIL PROTECTED] wrote: On 09/12/2007, Guido van Rossum [EMAIL PROTECTED] wrote: On Dec 8, 2007 12:37 AM, Graham Dumpleton [EMAIL PROTECTED] wrote: On 08/12/2007, Phillip J. Eby [EMAIL PROTECTED] wrote: * When running under Python 3, servers MUST provide a text stream for wsgi.errors In Python 3, what happens if user code attempts to output to a text stream a byte string? Ie., what would be displayed? Nothing. You get a TypeError. Hmmm, this in itself could be quite a pain for existing code where people have added debug code to print out details from request headers (if now to be passed as bytes), or part of the request content. Sorry, I was just talking about the write() method on a text stream. The print() function in 3.0 will print the repr() of the bytes. Example: Python 3.0a2 (py3k, Dec 10 2007, 09:38:42) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type help, copyright, credits or license for more information. a = bxyz print(a) b'xyz' b = babc\377def print(b) b'abc\xffdef' (Note that this works because print() always calls str() on the argument and bytes.str is defined to be the same as bytes.repr.) What is the suggested way of best dumping out bytes for debugging purposes so one does not have to worry about encoding issues, just use repr()? Just use print(). Also, if wsgi.errors is a text stream, presume that if a WSGI adapter has to internally map this to a C char* like API for logging that it would need to apply standard Python encoding to yield usable char* string for output. The encoding can/must be specified per text stream. But what should the encoding associated with the wsgi.errors stream be? Depends on the platform and your requirements. If code which outputs text to wsgi.errors can use any valid Unicode character, if one sets it to US-ASCII encoding, then chance that logging output will fail because of characters not being valid in that character set. If one instead uses UTF-8, then potentially have issues where that byte string coming out other end of text stream is passed to C API functions. Issues might arise here where C API not expecting variable width character encoding. I'll freely admit I am not across all this Unicode encode/decode stuff as I don't generally have to deal with foreign languages, but seems to be a few missing details in this area which need to be filled out for a modified WSGI specification. The goal of this part of Py3k is to make it more obvious when you haven't thought through your encoding issues enough by failing as soon as (encoded) bytes meet (decoded) characters. Of course, you can still run into delayed trouble by using an inappropriate encoding, which only shows up when there is an actual encoding or decoding error; but at least you will have carefully distinguished between encoded and decoded text throughout your program, so the fix is now to change the encoding rather than having to restructure your code to properly separate encoded and decoded text. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On 08/12/2007, Phillip J. Eby [EMAIL PROTECTED] wrote: * When running under Python 3, servers MUST provide a text stream for wsgi.errors In Python 3, what happens if user code attempts to output to a text stream a byte string? Ie., what would be displayed? Also, if wsgi.errors is a text stream, presume that if a WSGI adapter has to internally map this to a C char* like API for logging that it would need to apply standard Python encoding to yield usable char* string for output. Graham ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 8, 2007 12:37 AM, Graham Dumpleton [EMAIL PROTECTED] wrote: On 08/12/2007, Phillip J. Eby [EMAIL PROTECTED] wrote: * When running under Python 3, servers MUST provide a text stream for wsgi.errors In Python 3, what happens if user code attempts to output to a text stream a byte string? Ie., what would be displayed? Nothing. You get a TypeError. Also, if wsgi.errors is a text stream, presume that if a WSGI adapter has to internally map this to a C char* like API for logging that it would need to apply standard Python encoding to yield usable char* string for output. The encoding can/must be specified per text stream. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
[Alan] The restriction to iso-8859-1 is really a distraction; iso-8859-1 is used simply as an identity encoding that also enforces that all bytes in the string have a value from 0x00 to 0xff, so that they are suitable for byte-oriented IO. So, in output terms at least, WSGI *is* a byte-oriented protocol. The problem is the python-the-language didn't have support for bytes at the time WSGI was designed. [Thomas] If you're talking about the output stream, then yes, it's all about bytes (or should be). Indeed, I was only talking about output, specifically the response body. But at the status and headers level, HTTP/1.1 is fundamentally ISO-8859-1-encoded. Agreed. That is why the WSGI spec also states Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding. So in order to use non-ISO-8859-1 characters in response status strings or headers, you must use RFC 2047. As confirmed by the links you posted, this is a HTTP restriction, not a WSGI restriction. Regards, Alan. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
[Phillip] WSGI already copes, actually. Note that Jython and IronPython have this issue today, and see: http://www.python.org/dev/peps/pep-0333/#unicode-issues [James] It would seem very odd, however, for WSGI/python3 to use strings- restricted-to-0xFF for network I/O while everywhere else in python3 is going to use bytes for the same purpose. I think it's worth pointing out the reason for the current restriction to iso-8859-1 is *because* python did not have a bytes type at the time the WSGI spec was drawn up. IIRC, the bytes type had not yet even been proposed for Py3K. Cpython effectively held all byte sequences as strings, a paradigm which is (still) followed by jython (not sure about ironpython). The restriction to iso-8859-1 is really a distraction; iso-8859-1 is used simply as an identity encoding that also enforces that all bytes in the string have a value from 0x00 to 0xff, so that they are suitable for byte-oriented IO. So, in output terms at least, WSGI *is* a byte-oriented protocol. The problem is the python-the-language didn't have support for bytes at the time WSGI was designed. [James] You'd have to modify your app to call write(unicodetext.encode('utf-8').decode('latin-1')) or so Did you mean: write(unicodetext.encode('utf-8').encode('latin-1'))? Either way, the second encode is not required; write(unicodetext.encode('utf-8')) is sufficient, since it will generate a byte-sequence(string) which will (actually should: see (*) note below) pass the following test. try: wsgi_response_data.encode('iso-8859-1') except UnicodeError: # Illegal WSGI response data! On a side note, it's worth noting that Philip Jenvey's excellent rework of the jython IO subsystem to use java.nio is fundamentally byte oriented. http://www.nabble.com/fileno-support-is-not-in-jython.-Reason--t4750734.html http://fisheye3.cenqua.com/browse/jython/trunk/jython/src/org/python/core/io Because it is based on the new IO design for Python 3K, as described in PEP 3116 http://www.python.org/dev/peps/pep-3116/ Regards, Alan. [*] Although I notice that cpython 2.5, for a reason I don't fully understand, fails this particular encoding sequence. (Maybe it's to do with the possibility that the result of an encode operation is no longer an encodable string?) Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. response = uinterferon-gamma (IFN-\u03b3) responses in cattle response.encode('utf-8').encode('latin-1') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 22: ordinal not in range(128) Meaning that to enforce the WSGI iso-8859-1 convention on cpython 2.5, you would have to carry out this rigmarole response.encode('utf-8').decode('latin-1').encode('latin-1') 'interferon-gamma (IFN-\xce\xb3) responses in cattle' Perhaps this behaviour is an artifact of the cpython implementation? Whereas jython passes it just fine (and correctly, IMHO) Jython 2.2.1 on java1.4.2_15 Type copyright, credits or license for more information. response = uinterferon-gamma (IFN-\u03b3) responses in cattle response.encode('utf-8') 'interferon-gamma (IFN-\xCE\xB3) responses in cattle' response.encode('utf-8').encode('latin-1') 'interferon-gamma (IFN-\xCE\xB3) responses in cattle' ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
Phillip J. Eby wrote: So here are my recommendations so far for the addendum to WSGI *1.0* for Python 3.0 (I expect we can be more strict for WSGI 2.0): * When running under Python 3, applications SHOULD produce bytes output and headers * When running under Python 3, servers and gateways MUST accept strings as application output or headers, under the existing rules (i.e., s.encode('latin-1') must convert the string to bytes without an exception) * When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables that should NOT be strings?) I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not latin1. That is, after you urldecode the values (as WSGI asks you to do) proper conversion to text is to decode it as UTF8. I'm a bit confused on how HTTP_COOKIE gets encoded. And QUERY_STRING also confuses me. Is this all compatible with os.environ in py3k? I don't care that much if it does, but as the starting point for CGI it would be interesting if it stays in sync. -- Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
So here are my recommendations so far for the addendum to WSGI *1.0* for Python 3.0 (I expect we can be more strict for WSGI 2.0): * When running under Python 3, applications SHOULD produce bytes output and headers * When running under Python 3, servers and gateways MUST accept strings as application output or headers, under the existing rules (i.e., s.encode('latin-1') must convert the string to bytes without an exception) * When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables that should NOT be strings?) * When running under Python 3, servers MUST make wsgi.input a binary (byte) stream * When running under Python 3, servers MUST provide a text stream for wsgi.errors These rules are intended to simplify the porting of existing code. Notice, for example, that these rules allow middleware to pass strings through unchanged, since they are not required to produce bytes output or headers. Unfortunately, wsgi.input can't be coded around, but for most frameworks this should be a single point of pain. In fact, if the 'cgi' stdlib module is made compatible with bytes, only the rare framework that rolls its own multipart parser or otherwise directly manipulates put/post data will be affected. Code that just takes the input and writes it to a file won't be bothered, either. Comments or questions? ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 7, 2007, at 5:46 PM, Andrew Clover wrote: OTOH making the dictionaries reflect the underlying OS's conception of environment variables means users of os.environ and WSGI will have to be able to cope with both bytes and unicode, which would also be a big annoyance. In summary: urgh, this is all messy and 'orrible. I suppose this is more a question for python-dev, but, it'd be really nice if Python on Windows made it look like the windows system encoding was always UTF-8. That is, bytestrings used for open/ os.environ/argv/etc. are always encoded/decoded in utf-8, not the broken-platform-encoding. Then the same code would work just as well on unix as it does on windows. Actually, I bet I could implement that today, just by wrapping some stuffhmmm... James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
James Y Knight wrote: In addition, I know of nobody who actually implements RFC 2047 decoding of http header values...nothing really uses it. (of course I don't know of all implementations out there.) Certainly no browser supports it, which makes the point moot for WSGI. Most browsers, when quoting a header parameter, simply encode using the previous page's charset and put quotes around it... even if the parameter has a quote or control codes in it. Ian wrote: Is this all compatible with os.environ in py3k? In 3.0a2 os.environ has Unicode strings for both keys and values. This is correct for Windows where environment variables are explicitly Unicode, but questionable (IMO) for Unix where they're really bytes that may or may not represent decodeable Unicode strings. SCRIPT_NAME/PATH_INFO This already causes problems in Windows CGI applications! Because these are passed in environment variables, IIS* has to decode the submitted bytes to Unicode first. It seems always to choose UTF-8 for this job, which I suppose is the least bad guess, but hardly infallible. (* - haven't tested this with Apache for Windows yet.) In Python 2.x, os.environ being byte strings, Python/the C library then has to encode them back to bytes, which I believe ends up using the system codepage. Since the system codepage is never UTF-8 on Windows this means not only that the bytes read back from eg. PATH_INFO are not the same as the original bytes submitted to the web server, but that if there are characters outside the system codepage submitted, they'll be unrecoverable. If os.environ remains Unicode in Unix and WSGI follows it (as it must if CGI-invoked WSGI is to continue working smoothly), webapps that try to allow for non-ASCII characters in URLs are likely to get some nasty deployment problems that depend on the system encoding setting, something that will be particularly troublesome for end-users to debug and fix. OTOH making the dictionaries reflect the underlying OS's conception of environment variables means users of os.environ and WSGI will have to be able to cope with both bytes and unicode, which would also be a big annoyance. In summary: urgh, this is all messy and 'orrible. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
Adam Atlas [EMAIL PROTECTED] wrote: I'd say it would be best to only accept `bytes` objects +1. HTTP is inherently byte-based. Any translation between bytes and unicode characters should be done at a higher level, by whatever web framework is living above WSGI. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 7, 2007, at 2:55 PM, Phillip J. Eby wrote: * When running under Python 3, servers MUST provide CGI HTTP variables as strings, decoded from the headers using HTTP standard encodings (i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables that should NOT be strings?) A WSGI gateway should *not* decode headers using RFC 2047. It actually *cannot*, without knowing the structure of that particular header, because only TEXT tokens are encoded that way. In addition, I know of nobody who actually implements RFC 2047 decoding of http header values...nothing really uses it. (of course I don't know of all implementations out there.) On Dec 7, 2007, at 3:24 PM, Ian Bicking wrote: I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not latin1. That is, after you urldecode the values (as WSGI asks you to do) proper conversion to text is to decode it as UTF8. Surely not! URLs aren't always utf-8 encoded, only often. James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote: Has anyone had any thoughts about how WSGI is going to made to work with Python 3? From what I understand about changes in Python 3, the main issue seems to be the removal of string type in its current form. This is an issue as WSGI specification currently states that status, header names/values and the items returned by the iterable must all be string instances. This is done to ensure that the application has done any conversions from Unicode, where knowledge about encoding would be known, before being passed to WSGI adapter. In Python 3 the default for string type objects will effectively be Unicode. Is WSGI going to be made to somehow cope with that, or will application instead be required to return byte string objects instead? WSGI already copes, actually. Note that Jython and IronPython have this issue today, and see: http://www.python.org/dev/peps/pep-0333/#unicode-issues On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all strings referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007 4:15 PM, Phillip J. Eby [EMAIL PROTECTED] wrote: At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote: Has anyone had any thoughts about how WSGI is going to made to work with Python 3? From what I understand about changes in Python 3, the main issue seems to be the removal of string type in its current form. This is an issue as WSGI specification currently states that status, header names/values and the items returned by the iterable must all be string instances. This is done to ensure that the application has done any conversions from Unicode, where knowledge about encoding would be known, before being passed to WSGI adapter. In Python 3 the default for string type objects will effectively be Unicode. Is WSGI going to be made to somehow cope with that, or will application instead be required to return byte string objects instead? WSGI already copes, actually. Note that Jython and IronPython have this issue today, and see: http://www.python.org/dev/peps/pep-0333/#unicode-issues On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all strings referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters. That may work for IronPython/Jython, where encoded data is represented by the str type, but it won't be sufficient for Py3k, where encoded data is represented using the bytes type. IOW, in IronPython/Jython, u\u1234.encode('utf-8') returns a str instance: '\xe1\x88\xb4'; but in Py3k, it returns a bytes instance: b'\xe1\x88\xb4'. The issue applies to input as well as output -- data read from a socket is also represented as bytes, unless you're using makefile() with a text mode and an encoding. You might want to look at how the unittests for wsgiref manage to pass in Py3k though. ;-) -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007, at 7:15 PM, Phillip J. Eby wrote: WSGI already copes, actually. Note that Jython and IronPython have this issue today, and see: http://www.python.org/dev/peps/pep-0333/#unicode-issues On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all strings referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters. It would seem very odd, however, for WSGI/python3 to use strings- restricted-to-0xFF for network I/O while everywhere else in python3 is going to use bytes for the same purpose. You'd have to modify your app to call write(unicodetext.encode('utf-8').decode('latin-1')) or so James ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote: On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: In Python 3 the default for string type objects will effectively be Unicode. Is WSGI going to be made to somehow cope with that, or will application instead be required to return byte string objects instead? I'd say it would be best to only accept `bytes` objects; anything else would require some guesswork. Maybe, at most, it could try to encode returned Unicode objects as ISO-8859-1, and have it be an error if that's not possible. Actually, I'd prefer to look at it the other way around: a Python 3 WSGI server or middleware *may* accept bytes objects instead of str. This is relatively easy for the response side of things, but the request side is rather more difficult, since wsgi.input may need to be binary rather than text mode. (I think we can reasonably assume that wsgi.errors is a text mode stream, and should support a reasonable encoding.) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: In Python 3 the default for string type objects will effectively be Unicode. Is WSGI going to be made to somehow cope with that, or will application instead be required to return byte string objects instead? I'd say it would be best to only accept `bytes` objects; anything else would require some guesswork. Maybe, at most, it could try to encode returned Unicode objects as ISO-8859-1, and have it be an error if that's not possible. I was going to say that the gateway could accept Unicode objects if the user-agent sent a comprehensible Accept-Charset header, and thereby encode application output to the client's preferred character set on the fly (or to ISO-8859-1 if no Accept-Charset is provided), but that would complicate things for people writing gateways (and would be too implicit). It could be useful, but it would make more sense as a simple decorator for (almost-)WSGI applications. Perhaps it could go in wsgiref. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007 6:15 PM, Phillip J. Eby [EMAIL PROTECTED] wrote: WSGI already copes, actually. Note that Jython and IronPython have this issue today, and see: http://www.python.org/dev/peps/pep-0333/#unicode-issues I'm glad you brought that up, because it's been bugging me lately. That section is somewhat ambiguous as-is, because in one sentence applications are permitted to return strings encoded in a charset other than ISO-8859-1, but in another they are unequivocally forbidden to do so (with the must not in bold, even). And that's problematic not only because of the ambiguity, but because the increasing popularity of AJAX and web-based APIs is making it much more common for WSGI applications to generate responses of types which do not default to ISO-8859-1 -- e.g., XML and JSON, both of which default to UTF-8. Depending on how draconian one wishes to be when reading the relevant section of WSGI, it's possible to conclude that XML and JSON must always be transcoded/escaped to ISO-8859-1 -- with all the headaches that entails -- before being passed to a WSGI-compliant piece of software. And the slightly less strict reading of the spec -- that such gymnastics are required only when the string type of the Python implementation is Unicode-based -- will grow increasingly troublesome as/when Py3K enters production use. So as long as we're talking about this, could the proscriptions with respect to encoding perhaps be revisited and (hopefully) clarified/revised? -- Bureaucrat Conrad, you are technically correct -- the best kind of correct. ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
Phillip J. Eby wrote: At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote: On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: In Python 3 the default for string type objects will effectively be Unicode. Is WSGI going to be made to somehow cope with that, or will application instead be required to return byte string objects instead? I'd say it would be best to only accept `bytes` objects; anything else would require some guesswork. Maybe, at most, it could try to encode returned Unicode objects as ISO-8859-1, and have it be an error if that's not possible. Actually, I'd prefer to look at it the other way around: a Python 3 WSGI server or middleware *may* accept bytes objects instead of str. This is relatively easy for the response side of things, but the request side is rather more difficult, since wsgi.input may need to be binary rather than text mode. (I think we can reasonably assume that wsgi.errors is a text mode stream, and should support a reasonable encoding.) wsgi.input definitely seems like it should be bytes to me. Unless we want to put the encoding process into the server. Not entirely infeasible, but a bit of a strain. And the request body might very well be binary, e.g., on a PUT. The CGI keys in the environment don't feel at all like bytes to me, but then they aren't unicode either. They can be unicode, again given a bit of work on the server side. Though unfortunately browsers are very poor at indicating their encoding for requests, and it ends up being policy and configuration as much as anything that determines the encoding of stuff like wsgi.input. I believe all request paths are UTF8 (?), but I'm not sure about QUERY_STRING. I'm a little fuzzy on some of the details there. The actual response body should also be bytes. Unless again we want to introduce upstream encoding. This does make everything feel more complicated. -- Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com
Re: [Web-SIG] WSGI, Python 3 and Unicode
On Dec 6, 2007 8:00 PM, Ian Bicking [EMAIL PROTECTED] wrote: Phillip J. Eby wrote: At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote: On 6 Dec 2007, at 18:13, Graham Dumpleton wrote: In Python 3 the default for string type objects will effectively be Unicode. Is WSGI going to be made to somehow cope with that, or will application instead be required to return byte string objects instead? I'd say it would be best to only accept `bytes` objects; anything else would require some guesswork. Maybe, at most, it could try to encode returned Unicode objects as ISO-8859-1, and have it be an error if that's not possible. Actually, I'd prefer to look at it the other way around: a Python 3 WSGI server or middleware *may* accept bytes objects instead of str. This is relatively easy for the response side of things, but the request side is rather more difficult, since wsgi.input may need to be binary rather than text mode. (I think we can reasonably assume that wsgi.errors is a text mode stream, and should support a reasonable encoding.) wsgi.input definitely seems like it should be bytes to me. Unless we want to put the encoding process into the server. Not entirely infeasible, but a bit of a strain. And the request body might very well be binary, e.g., on a PUT. The CGI keys in the environment don't feel at all like bytes to me, but then they aren't unicode either. They can be unicode, again given a bit of work on the server side. Though unfortunately browsers are very poor at indicating their encoding for requests, and it ends up being policy and configuration as much as anything that determines the encoding of stuff like wsgi.input. I believe all request paths are UTF8 (?), but I'm not sure about QUERY_STRING. I'm a little fuzzy on some of the details there. The actual response body should also be bytes. Unless again we want to introduce upstream encoding. This does make everything feel more complicated. It's the same level of complexity you run into as soon as you want to handle Unicode with WSGI in 2.x though, as it is caused by something outside our control (HTTP and browsers). -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com