Just to narrow in on one case, URLs, there are a few pieces of information
that make up the URL:

wsgi.url_scheme: this is *not* present in the request, it's inferred somehow
(e.g., by the port the client connected to)

HTTP_HOST: this is a header.  It typically contains both the hostname and
the port.  The encoding is generally idna, though you have to split the port
off first.  The unicode version of the hostname is not widely supported in
client libraries (it's usually applied at the UI level).

SCRIPT_NAME/PATH_INFO: these represent a portion of the request path (before
?).  As submitted these are generally ASCII (URL-quoted).  After unquoting,
they are typically UTF-8, but may be of any or no encoding.  If an unsafe
character is present in the URL-quoted version of the path, it may be quoted
at the byte level.  The '?' character is effectively a byte-oriented marker
and encodings cannot affect it.

QUERY_STRING: this is also generally ASCII (URL-quoted).  Unsafe characters
could be quoted at the byte level.

Generally I'm unaware of any reasonable situation where quoting unsafe
characters in an HTTP request would be improper, or even lose any meaningful
information.  Mostly because I don't know of any clients that actually would
expect unsafe characters to work.  Quoting HTTP_HOST is difficult, as it's
not a byte-oriented quoting, it's a fairly complex encoding.  But I'm also
not sure where in a stack you could actually handle unsafe characters in
HTTP_HOST -- it seems like simply an invalid request, and deferring the
error won't give another part of the stack the opportunity to do the right
thing.

In their quoted form all these values (at least including the quoted path,
not the unquoted SCRIPT_NAME/PATH_INFO) *should* be ASCII, and I believe a
WSGI server could ensure they were all ASCII without any loss of useful
information (either by simply rejecting the request or by applying
quoting).  I don't see any place where bytes are advantageous.  Representing
invalid requests does not seem particularly helpful -- *some* invalid
requests are useful to handle (e.g., weird cookies) but in the case of the
URL variables I don't see any benefit.

IMHO all the tricky encoding issues are in the request and response bodies,
and I'm pretty sure we have consensus that those should be bytes.

Reiterating other encoding issues I'm aware of:

Cookie encodings, but parsing cookies as bytes or Latin1 is basically
equivalent, and I don't believe that, for instance, they should ever be
parsed as UTF-8.  Parsing as bytes might avoid an unnecessary
encoding/decoding, but it's all tricky enough that libraries should do it
anyway, and the encoding overhead alone isn't very important.

Another example is the Atom Title header (
http://bitworking.org/projects/atom/draft-ietf-atompub-protocol-08.html#rfc.section.8.1.2)
but that's supposed to be Latin1 with RFC2047 encodings, and I don't believe
anyone is proposing that RFC2047 encodings be handled generally at the WSGI
layer (I think CherryPy does or used to handle these, but there were many
objections at least on this list about it, in part due to security
concerns).  A 2047 encoding is like "Title:
=?utf-8?q?stuff-with=-escaping?=".

Response headers are equivalent to request headers.  Response status is
constrained by the spec to Latin1, and there are no use cases I know of
(even really obscure ones) where it would be necessary to use other
encodings.

And that's it!  HTTP has a fairly finite amount of surface area.

-- 
Ian Bicking  |  http://blog.ianbicking.org
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to