2009/5/5 Armin Ronacher <armin.ronac...@active-4.com>: > Hello everybody, > > I just recently started looking at supporting Python 3 with one of my > libraries > (Werkzeug), mainly because the MoinMoin projects considers using it which uses > the library in question. Right now what Werkzeug does is consider HTTP being > Unicode aware in the sense that everything that carries text data is encoded > and > decoded into a known encoding. > > This is partially against the specification and not entirely correct, but it > works the best on modern browsers and is also what Django and Paste are doing. > > It's basically that the incoming request data is .decode(encoding)d (usually > utf-8) before passed to the user code and unicode data is encoded back into > the > same encoding before it's sent to the server. > > Now why is the current behavior of Python 3 a problem here? The encode, > decode > hack from above is obviously a solution for these kinds of applications, > albeit > not a good one. Interfaces like mod_wsgi already have the data as bytestring, > would decode it from latin1 just that the application can encode it back and > decode as utf-8. Not only is this slow but also does this mean that the code > does not survive a run through 2to3. > > Now you could argue that the libraries where wrong in the first place and > should > support unicode strings that were encoded from latin1 and decoded, but seems > like very few libraries support that. > > Now which strings carry data that could contain non-ascii characters from a > source with an unknown encoding? Right now these are the following: > > * PATH_INFO > * SCRIPT_NAME > * QUERY_STRING > * CONTENT_TYPE > * HTTP_*
Depending on underlying web server that WSGI adapter runs on, there might also be: REQUEST_URI PATH_TRANSLATED (??) Yes I know these aren't required for WSGI, except to the extent that WSGI specification says: "A server or gateway should attempt to provide as many other CGI variables as are applicable." Would have to check CGI but there may be more. The way I thus read this is that keys are always strings, values will be strings, except for specific list of entries where values would be bytes. Also, presume that wsgi.url_scheme will have string value. Where things get difficult for me with Apache is where users can use SetEnv or mod_rewrite to define additional key/values to be added to the WSGI environment. For example: SetEnv trac.env_path /some/path I can't see but have choice but to pass such settings through as strings, else more than likely would cause problems for applications. Problem is it isn't clear what encoding stuff can be in Apache configuration. At the moment latin-1 is assumed. Things though get more complicated when mod_rewrite is used, as the values could be derived from components of the URL which are being treated as bytes above. For example: RewriteCond %{THE_REQUEST} ^\ *([A-Z]+)\ *(.*)\ *(HTTP/.*)$ RewriteRule . - [E=UNPARSERD_URI:%1] So, this is creating a new UNPARSED_URI value which is original URL as appeared in the request line. I can't know that strictly speaking that this should be bytes. As such, I think all I can do is always pass through additional values as string, interpreted as latin-1. If some special case handling is required, would be up to WSGI application. I am not too keen on special configuration directives to allow encoding and/or whether bytes used, to be specified for each possible variable being set. Anyway, this is special case stuff and if being done is likely going to be special to Apache/mod_wsgi. If people want consistency, they should just implement it as a WSGI middleware where they can rather than usind mod_rewrite fiddles. Now, if we are going to start using bytes for request headers, there is the other question of response data. The original proposal in amendments was that application should provide bytes, but that WSGI adapter must accept either bytes or strings, with strings interpreted as latin-1. Is there sense in being more strict in this case? In Python 2.X some WSGI adapters only allow Python 2.X strings (ie., bytes) and reject unicode strings. Others will convert unicode strings, but rather than use latin-1, apply the default Python encoding. Thus, there is no consistency. As to wsgi.file_wrapper, the only logical thing seems to be required file object to return bytes, ie. raw mode, and not be in text mode. Ultimately I am just implementing the WSGI adapter, I'll follow whatever is decided. I am not in a position, since I don't develop stuff that runs on it, to know what is best. So, as long as it is clear what should be passed through as bytes for environment, ie., there is an all inclusive list, and don't somehow have to guess, then am fine either way. I'd just like to see some decision and for that decision not to be some time next year as am holding up mod_wsgi 3.0 until things have been clarified. :-( Graham > Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and > CONTENT_TYPE). Now it's true that the headers should not contain non latin1 > values but reality shows that they do. Cookies are transmitted as headers as > well and no browser complains if you put utf-8 encoded stuff into it. It may > be > the case that for the browser this looks like latin1, but in the end the > application decodes it from utf-8 and is happy. > > Data sent from the application can continue to work like they do currently. > However for django, Werkzeug, paste and many others that support unicode > output > will just check if the output is unicode, and if that's the case, encode to > the > desired encoding. > > Also people abuse middlewares a lot and they deal with incoming and outgoing > data as well. One can expect these middlewares to work on known encodings as > well so those would do the encode / decode dance too. > > If one knows the encoding of the environ, then the webserver. Apparently > there > are issues getting the encoding of the environ but those won't go away when > moving that to the web application. > > Because of that I propose that Python 3 would ship a version of wsgiref with > Python 3.1 that uses bytestrings for the headers in question and add a section > on Python 3 compatibility based on that to PEP 333. > > I volunteer for writing a new section on Python 3 in PEP 333 :-) > > > Regards, > Armin > > _______________________________________________ > Web-SIG mailing list > Web-SIG@python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com