Hello everybody, I just recently started looking at supporting Python 3 with one of my libraries (Werkzeug), mainly because the MoinMoin projects considers using it which uses the library in question. Right now what Werkzeug does is consider HTTP being Unicode aware in the sense that everything that carries text data is encoded and decoded into a known encoding.
This is partially against the specification and not entirely correct, but it works the best on modern browsers and is also what Django and Paste are doing. It's basically that the incoming request data is .decode(encoding)d (usually utf-8) before passed to the user code and unicode data is encoded back into the same encoding before it's sent to the server. Now why is the current behavior of Python 3 a problem here? The encode, decode hack from above is obviously a solution for these kinds of applications, albeit not a good one. Interfaces like mod_wsgi already have the data as bytestring, would decode it from latin1 just that the application can encode it back and decode as utf-8. Not only is this slow but also does this mean that the code does not survive a run through 2to3. Now you could argue that the libraries where wrong in the first place and should support unicode strings that were encoded from latin1 and decoded, but seems like very few libraries support that. Now which strings carry data that could contain non-ascii characters from a source with an unknown encoding? Right now these are the following: * PATH_INFO * SCRIPT_NAME * QUERY_STRING * CONTENT_TYPE * HTTP_* Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and CONTENT_TYPE). Now it's true that the headers should not contain non latin1 values but reality shows that they do. Cookies are transmitted as headers as well and no browser complains if you put utf-8 encoded stuff into it. It may be the case that for the browser this looks like latin1, but in the end the application decodes it from utf-8 and is happy. Data sent from the application can continue to work like they do currently. However for django, Werkzeug, paste and many others that support unicode output will just check if the output is unicode, and if that's the case, encode to the desired encoding. Also people abuse middlewares a lot and they deal with incoming and outgoing data as well. One can expect these middlewares to work on known encodings as well so those would do the encode / decode dance too. If one knows the encoding of the environ, then the webserver. Apparently there are issues getting the encoding of the environ but those won't go away when moving that to the web application. Because of that I propose that Python 3 would ship a version of wsgiref with Python 3.1 that uses bytestrings for the headers in question and add a section on Python 3 compatibility based on that to PEP 333. I volunteer for writing a new section on Python 3 in PEP 333 :-) Regards, Armin _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com