I can't read all this thread carefully, too much stuff. I will note however that people are STILL ignoring surrogateescape (http://www.python.org/dev/peps/pep-0383/). This is like the third or fourth time I've brought it up. It was added to Python 3.1 for some of the exact issues we are encountering.
Particularly, imagine someone requests /foo%efbar (which is not valid UTF-8). >>> SCRIPT_NAME = b'/foo\xefbar' # after url unquoting (urllib.request.unquote >>> doesn't work for this currently) >>> s = SCRIPT_NAME.decode('utf8', 'surrogateescape') >>> s '/foo\udcefbar' >>> s.encode('utf8', 'surrogateescape') b'/foo\xefbar' So we can have unicode values that can be safely and correctly transcoded to other encodings (or handled in their raw form). The constraints on surrogateescape are: * You have to use 'surrogateescape' during decoding and encoding (I think for decoding it should be part of the spec) * You have to know the encoding; doing s.encode('latin1', 'surrogateescape') wouldn't necessarily preserve the correct bytes (it does for this example, but wouldn't if there was a mix of valid UTF-8 and invalid bytes) And there's a bit of an annoyance to the fact that SCRIPT_NAME/PATH_INFO should always be treated as UTF-8 (which might sometimes be wrong, but for any modern app/browser will be right), but maybe other parts (HTTP_COOKIE?) are in "native" encoding. Well, besides HTTP_COOKIE, I don't know what else would be in a different encoding. Atompub adds Slug, but it's a URL/IRI, so it should be ASCII. I have seen proposals for a Title header (e.g., when PUTting an image and giving it a title), and that could be unicode. But in all those cases it'll be a modern app and modern clients, and in those cases people just use UTF-8. Frankly I'm open to UTF-8-everywhere. People mentioned Jack and Rack, and to what degree that works, it probably works because everyone uses UTF-8. With surrogateescape we allow transcoding when needed (e.g., if you wanted to handle redirects from old/weird non-UTF-8 URLs) but keep things reasonably simple otherwise. Ian _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com