FWIW, there was a past discussion on these issues on mod_wsgi list. I can't really remember what the outcome of the discussion was. The discussion is at:
http://groups.google.com/group/modwsgi/browse_frm/thread/2471a1a71620629f Graham 2008/11/13 Andrew Clover <[EMAIL PROTECTED]>: > It would be lovely if we could allow WSGI applications to reliably accept > Unicode paths. > > That is to say, allow WSGI apps to have beautiful URLs like Wikipedia's, > without requiring URL-rewriting magic. (Which is so highly server-specific, > potentially unavailable to non-admin webmasters, and makes WSGI app > deployment more difficult than it already is.) > > > If we could reliably read the bytes the browser sends to us in the GET > request that would be great, we could just decode those and be done with it. > Unfortunately, that's not reliable, because: > > 1. thanks to an old wart in the CGI specification, %XX hex escapes are > decoded before the character is put into the PATH_INFO environment variable; > > 2. the environment variables may be stored as Unicode. > > (1) on its own gives us the problem of not being able to distinguish a > path-separator slash from an encoded %2F; a long-known problem but not one > that greatly affects most people. > > But combined with (2) that means some other component must choose how to > decode the bytes into Unicode characters. No standard currently specifies > what encoding to use, it is not typically configuarable, and it's certainly > not within reach of the WSGI application. My assumption is that most > applications will want to end up with UTF-8-encoded URLs; other choices are > certainly possible but as we move towards IRI they become less likely. > > > This situation previously affected only Windows users, because NT > environment variables are native Unicode. However, Python 3.0 specifies all > environment variable access is through a Unicode wrapper, and gives no way > to control how that automatic decoding is done, leaving everyone in the same > boat. > > WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ should > be "decoded from the headers using HTTP standard encodings (i.e. latin-1 + > RFC 2047)", but unfortunately this doesn't quite work: > > 1. for many existing environments the decoding-from-headers charset is out > of reach of the WSGI server/layer and may well not be ISO-8859-1. Even > wsgiref doesn't currently use 8859-1 (see below). > > 2. RFC2047 is not applicable to HTTP headers, which are not really > 822-family headers even though they look just like them. The sub-headers in > eg. a multipart/form-data chunk *are* (probably) proper 822 headers so > RFC2047 could apply, but those headers are already dealt with by the > application or framework, not WSGI. HTTP 1.1 (RFC2616) does refer to RFC2047 > as an encoding mechanism for TEXT and quoted-string, but this makes no sense > as 2047 itself requires embedding in atom-based parsing sequences which > those productions are not (quoted-strings are explicitly disallowed by 2047 > itself). In any case no existing browser attempts to support RFC2047 > encoding rules for any possible interpretation of what 2616 might mean. > > > Something like Luís Bruno's ORIGINAL_PATH_INFO proposal > (http://mail.python.org/pipermail/web-sig/2008-January/003124.html) would be > worth looking at for this IMO. It may be of questionable usefulness if the > only character affected is the slash, but it also happens to solve the > Unicode problem. Obviously whatever it was called it would have to be an > optional additional value in the WSGI environ, as pure CGI servers wouldn't > be able to supply it. Conceivably it might also be possible to have a > standardised mod_rewrite rule to make the variable also available to Apache > CGI scripts, but still this is far from global availability. > > In the meantime I've been looking at how various combinations of servers > deal with this issue, and in what circumstances an application or middleware > can safely recover all possible Unicode input. 'Apache' refers to the > (AFAICT-identical) behaviour of both mod_cgi and mod_wsgi; 'IIS' refers to > IIS with CGI. > > > *** Apache/Posix/Python2 > OK. > > No problem here, it's byte-based all the way through. > > > *** Apache/Posix/Python3: > Dependent on the default encoding. > > Apache puts bytes into the envvars but Python takes them out as unicode. If > the system default encoding happens to be the same as the encoding the WSGI > application wanted we will be OK. Normally the app will want UTF-8; many > Linux distributions do use UTF-8 as the default system encoding but there > are plenty of distros (eg. Debian) and other Unixen that do not. In any case > we are getting a nasty system dependency at deploy time that many webmasters > will not be able to resolve. > > It is sometimes possible to recover mangled characters despite the wrong > decoding having been applied. For example if the system encoding was > ISO-8859-1 or another encoding that maps every byte to a unique Unicode > character, we can encode the Unicode string back to its original bytes, and > thence apply the decoding we actually wanted! If, on the other hand, it's > something like ISO-8859-4, where not all high bytes are mapped at all, we'll > be losing random characters... not good. > > > *** Apache/NT/Python2 > Always unrecoverable data loss. > > Apache on Windows always uses ISO-8859-1 to decode the request path and put > it in the Unicode envvars. This is OK so far, we have Unicode characters > with the same codepoints as the original bytes. However, Python2 needs to > make the envvars available as bytes. It uses the system default encoding; if > that were ISO-8859-1, we'd be OK. > > But it never is. Western European on NT is actually cp1252, whose characters > in the range 0x80 to 0x9F differ from ISO-8859-1. And if the app wants > UTF-8, chances are those characters are going to come up a lot. There is as > far as I know no user-selectable Windows codepage that can map all the > Unicode characters up to U+00FF. > > > *** Apache/NT/Python3 > Wrong, but always recoverable. > > Python retreives the bytes-encoded-into-Unicode-codepoints string directly > from the envvars. If the encoding should have been UTF-8 or something else > other than ISO-8859-1, we can recover the original bytes by re-encoding to > 8859-1, then decoding using the real charset. > > > *** IIS/NT/Python2 > Mostly unrecoverable data loss. > > IIS decodes submitted bytes to Unicode using UTF-8 when it can. But if there > is an invalid UTF-8 sequence in the bytes it will try again using the system > codepage. Python will then re-encode the Unicode envvar using the system > codepage. > > If the app is expecting UTF-8 we can decode what Python gives us using the > system codepage (ie. 'mbcs') and get back any of the submitted characters > that happened to be in this server's system codepage. Other characters may > be replaced by question marks or Windows's best attempts to give us > something useful, which at best may be a character shorn of diacriticals and > at worst something just completely wrong. > > NT's system codepage is never UTF-8, it is not a user-selectable option > never mind the default. We can improve our chances of getting more > characters through by using a character set with a wide repertoire, such as > cp932 (Shift-JIS). But it's still not really proper Unicode support. > > If the app is expecting something non-UTF-8 there's not much hope. Even if > it wanted the same character set as the system codepage, it can't be sure > that the submitted bytes didn't happen to also be a valid UTF-8 sequence, > and thus get mangled by IIS decoding them that way. > > > *** IIS/NT/Python3 > OK, as long as the app wants UTF-8. > > Incoming UTF-8 bytes are reliably converted to Unicode strings by IIS, and > directly read by Python from the envvars. > > If the application didn't want UTF-8 the situation is about as hopeless as > with Python2. > > > *** wsgiref.simple_server/(any)/Python2 > OK. > > Bytes all the way through. > > > *** wsgiref.simple_server/(any)/Python3: > Probably will be OK, as long as the app wants UTF-8. > > simple_server is currently broken in rc2. However judging by the code, it is > using urllib.parse.unquote, which assumes UTF-8, so it'll be fine for apps > that want UTF-8 and hopeless for those that don't. > > > I'd be very interested to hear what other servers are doing in this > situation - nginx? cherrypy's one? - and wonder if any particular behaviour > should be 'blessed'. > > -- > And Clover > mailto:[EMAIL PROTECTED] > http://www.doxdesk.com/ > _______________________________________________ > Web-SIG mailing list > Web-SIG@python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com