[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Andrew Clover Wed, 12 Nov 2008 11:49:25 -0800

It would be lovely if we could allow WSGI applications to reliablyaccept Unicode paths.

That is to say, allow WSGI apps to have beautiful URLs like Wikipedia's,without requiring URL-rewriting magic. (Which is so highlyserver-specific, potentially unavailable to non-admin webmasters, andmakes WSGI app deployment more difficult than it already is.)

If we could reliably read the bytes the browser sends to us in the GETrequest that would be great, we could just decode those and be done withit. Unfortunately, that's not reliable, because:

1. thanks to an old wart in the CGI specification, %XX hex escapes aredecoded before the character is put into the PATH_INFO environment variable;


2. the environment variables may be stored as Unicode.

(1) on its own gives us the problem of not being able to distinguish apath-separator slash from an encoded %2F; a long-known problem but notone that greatly affects most people.

But combined with (2) that means some other component must choose how todecode the bytes into Unicode characters. No standard currentlyspecifies what encoding to use, it is not typically configuarable, andit's certainly not within reach of the WSGI application. My assumptionis that most applications will want to end up with UTF-8-encoded URLs;other choices are certainly possible but as we move towards IRI theybecome less likely.

This situation previously affected only Windows users, because NTenvironment variables are native Unicode. However, Python 3.0 specifiesall environment variable access is through a Unicode wrapper, and givesno way to control how that automatic decoding is done, leaving everyonein the same boat.

WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environshould be "decoded from the headers using HTTP standard encodings (i.e.latin-1 + RFC 2047)", but unfortunately this doesn't quite work:

1. for many existing environments the decoding-from-headers charset isout of reach of the WSGI server/layer and may well not be ISO-8859-1.Even wsgiref doesn't currently use 8859-1 (see below).

2. RFC2047 is not applicable to HTTP headers, which are not really822-family headers even though they look just like them. The sub-headersin eg. a multipart/form-data chunk *are* (probably) proper 822 headersso RFC2047 could apply, but those headers are already dealt with by theapplication or framework, not WSGI. HTTP 1.1 (RFC2616) does refer toRFC2047 as an encoding mechanism for TEXT and quoted-string, but thismakes no sense as 2047 itself requires embedding in atom-based parsingsequences which those productions are not (quoted-strings are explicitlydisallowed by 2047 itself). In any case no existing browser attempts tosupport RFC2047 encoding rules for any possible interpretation of what2616 might mean.

Something like Luís Bruno's ORIGINAL_PATH_INFO proposal(http://mail.python.org/pipermail/web-sig/2008-January/003124.html)would be worth looking at for this IMO. It may be of questionableusefulness if the only character affected is the slash, but it alsohappens to solve the Unicode problem. Obviously whatever it was calledit would have to be an optional additional value in the WSGI environ, aspure CGI servers wouldn't be able to supply it. Conceivably it mightalso be possible to have a standardised mod_rewrite rule to make thevariable also available to Apache CGI scripts, but still this is farfrom global availability.

In the meantime I've been looking at how various combinations of serversdeal with this issue, and in what circumstances an application ormiddleware can safely recover all possible Unicode input. 'Apache'refers to the (AFAICT-identical) behaviour of both mod_cgi and mod_wsgi;'IIS' refers to IIS with CGI.



*** Apache/Posix/Python2
OK.

No problem here, it's byte-based all the way through.


*** Apache/Posix/Python3:
Dependent on the default encoding.

Apache puts bytes into the envvars but Python takes them out as unicode.If the system default encoding happens to be the same as the encodingthe WSGI application wanted we will be OK. Normally the app will wantUTF-8; many Linux distributions do use UTF-8 as the default systemencoding but there are plenty of distros (eg. Debian) and other Unixenthat do not. In any case we are getting a nasty system dependency atdeploy time that many webmasters will not be able to resolve.

It is sometimes possible to recover mangled characters despite the wrongdecoding having been applied. For example if the system encoding wasISO-8859-1 or another encoding that maps every byte to a unique Unicodecharacter, we can encode the Unicode string back to its original bytes,and thence apply the decoding we actually wanted! If, on the other hand,it's something like ISO-8859-4, where not all high bytes are mapped atall, we'll be losing random characters... not good.



*** Apache/NT/Python2
Always unrecoverable data loss.

Apache on Windows always uses ISO-8859-1 to decode the request path andput it in the Unicode envvars. This is OK so far, we have Unicodecharacters with the same codepoints as the original bytes. However,Python2 needs to make the envvars available as bytes. It uses the systemdefault encoding; if that were ISO-8859-1, we'd be OK.

But it never is. Western European on NT is actually cp1252, whosecharacters in the range 0x80 to 0x9F differ from ISO-8859-1. And if theapp wants UTF-8, chances are those characters are going to come up alot. There is as far as I know no user-selectable Windows codepage thatcan map all the Unicode characters up to U+00FF.



*** Apache/NT/Python3
Wrong, but always recoverable.

Python retreives the bytes-encoded-into-Unicode-codepoints stringdirectly from the envvars. If the encoding should have been UTF-8 orsomething else other than ISO-8859-1, we can recover the original bytesby re-encoding to 8859-1, then decoding using the real charset.



*** IIS/NT/Python2
Mostly unrecoverable data loss.

IIS decodes submitted bytes to Unicode using UTF-8 when it can. But ifthere is an invalid UTF-8 sequence in the bytes it will try again usingthe system codepage. Python will then re-encode the Unicode envvar usingthe system codepage.

If the app is expecting UTF-8 we can decode what Python gives us usingthe system codepage (ie. 'mbcs') and get back any of the submittedcharacters that happened to be in this server's system codepage. Othercharacters may be replaced by question marks or Windows's best attemptsto give us something useful, which at best may be a character shorn ofdiacriticals and at worst something just completely wrong.

NT's system codepage is never UTF-8, it is not a user-selectable optionnever mind the default. We can improve our chances of getting morecharacters through by using a character set with a wide repertoire, suchas cp932 (Shift-JIS). But it's still not really proper Unicode support.

If the app is expecting something non-UTF-8 there's not much hope. Evenif it wanted the same character set as the system codepage, it can't besure that the submitted bytes didn't happen to also be a valid UTF-8sequence, and thus get mangled by IIS decoding them that way.



*** IIS/NT/Python3
OK, as long as the app wants UTF-8.

Incoming UTF-8 bytes are reliably converted to Unicode strings by IIS,and directly read by Python from the envvars.

If the application didn't want UTF-8 the situation is about as hopelessas with Python2.



*** wsgiref.simple_server/(any)/Python2
OK.

Bytes all the way through.


*** wsgiref.simple_server/(any)/Python3:
Probably will be OK, as long as the app wants UTF-8.

simple_server is currently broken in rc2. However judging by the code,it is using urllib.parse.unquote, which assumes UTF-8, so it'll be finefor apps that want UTF-8 and hopeless for those that don't.

I'd be very interested to hear what other servers are doing in thissituation - nginx? cherrypy's one? - and wonder if any particularbehaviour should be 'blessed'.


--
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Reply via email to