Andrew Clover wrote:
If we could reliably read the bytes the browser sends to us in the GET
request that would be great, we could just decode those and be done with
it. Unfortunately, that's not reliable, because:
1. thanks to an old wart in the CGI specification, %XX hex escapes are
decoded before the character is put into the PATH_INFO environment
I don't see a problem with this? At least not a problem with respect to
encoding. As it is (in Python 2), you should do something like
environ['PATH_INFO'].decode('utf8') and it should work. It doesn't seem
like there's any distinction between %-encoded characters and plain
characters in this situation.
2. the environment variables may be stored as Unicode.
(1) on its own gives us the problem of not being able to distinguish a
path-separator slash from an encoded %2F; a long-known problem but not
one that greatly affects most people.
But combined with (2) that means some other component must choose how to
decode the bytes into Unicode characters. No standard currently
specifies what encoding to use, it is not typically configuarable, and
it's certainly not within reach of the WSGI application. My assumption
is that most applications will want to end up with UTF-8-encoded URLs;
other choices are certainly possible but as we move towards IRI they
become less likely.
This situation previously affected only Windows users, because NT
environment variables are native Unicode. However, Python 3.0 specifies
all environment variable access is through a Unicode wrapper, and gives
no way to control how that automatic decoding is done, leaving everyone
in the same boat.
WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ
should be "decoded from the headers using HTTP standard encodings (i.e.
latin-1 + RFC 2047)", but unfortunately this doesn't quite work:
My understanding of this suggestion is that latin-1 is a way of
representing bytes as unicode. In other words, the values will be
unicode, but that will simply be a lie. So if you know you have UTF8
paths, you'd do:
path_info = environ['PATH_INFO'].encode('latin-1').decode('utf8')
As far as I can tell this is simply to avoid having bytes in the
environment, even though bytes are an accurate representation and
unicode is not.
A lot of what you write about has to do with CGI, which is the only
place WSGI interacts with os.environ. CGI is really an aspect of the
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI
Personally I'm more inclined to set up a policy on the WSGI server
itself with respect to the encoding, and then use real unicode
characters. Unfortunately that's not as flexible as bytes, as it
doesn't make it very easy to sniff out the encoding in
application-specific ways, or support different encodings in different
parts of the server (which would be useful if, for instance, you were to
proxy applications with unknown encodings). So... maybe that's not the
most feasible option. But if it's not, then I'd rather stick with bytes.
Ian Bicking : [EMAIL PROTECTED] : http://blog.ianbicking.org
Web-SIG mailing list
Web SIG: http://www.python.org/sigs/web-sig