subject:"\[Web\-SIG\] WSGI for Python 3"

Re: [Web-SIG] WSGI for Python 3

2010-08-30 Thread P.J. Eby


At 02:37 PM 8/30/2010 +1000, Graham Dumpleton wrote:

Anyway, rather than keep arguing the point and move forward, let us
perhaps start now with the following definitions and new names to
identify them. We can even go a bit stupid and give each its own code
name so they are in part more memorable. Any next option based on your
suggestions about changing the WHEAT option can be called MAIZE. And
if you thinking I am going stark raving mad and should be put in a
white jacket and locked up, you could well be right. I am not a happy
camper right now, but that is because of many things besides this WSGI
stuff. :-)

 And yes I know about the page that has been just recently put up at:

  http://www.wsgi.org/wsgi/Python_3

From memory when I first read it I wasn't sure if that it was
completely accurate, but at least it doesn't now mention mod_python
instead of mod_wsgi which was mighty confusing. We can perhaps merge
the following into that page, ie., expand the table, and talk more
about the abstract definitions rather than linking it to specific
implementations at this point. We can perhaps then start capturing the
pros and cons against each option in the page rather than loosing them
in the email chain.


I've added a column to the page called "flat" that captures my 
current proposal (native keys, surrogateescape values, byte stream 
in, strict bytes-only for all outputs).  This seems to me an optimum 
balance between:


* Verifiability (especially *composable* verifiability)
* Low cognitive overhead (i.e., fewest things to remember)
* Low amount of finger-typing and fewer conversions

But I certainly could be convinced otherwise by example or argument.

(One other thing I consider a plus for this approach, btw: os.environ 
is still largely usable as a WSGI environ in the CGI case.  This 
isn't so much a valuable thing in itself, as that it's an indicator 
of low complexity and cognitive overhead.) 


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Re: [Web-SIG] WSGI for Python 3

2010-08-30 Thread Ian Bicking

Just to narrow in on one case, URLs, there are a few pieces of information
that make up the URL:

wsgi.url_scheme: this is *not* present in the request, it's inferred somehow
(e.g., by the port the client connected to)

HTTP_HOST: this is a header. It typically contains both the hostname and
the port. The encoding is generally idna, though you have to split the port
off first. The unicode version of the hostname is not widely supported in
client libraries (it's usually applied at the UI level).

SCRIPT_NAME/PATH_INFO: these represent a portion of the request path (before
?). As submitted these are generally ASCII (URL-quoted). After unquoting,
they are typically UTF-8, but may be of any or no encoding. If an unsafe
character is present in the URL-quoted version of the path, it may be quoted
at the byte level. The '?' character is effectively a byte-oriented marker
and encodings cannot affect it.

QUERY_STRING: this is also generally ASCII (URL-quoted). Unsafe characters
could be quoted at the byte level.

Generally I'm unaware of any reasonable situation where quoting unsafe
characters in an HTTP request would be improper, or even lose any meaningful
information. Mostly because I don't know of any clients that actually would
expect unsafe characters to work. Quoting HTTP_HOST is difficult, as it's
not a byte-oriented quoting, it's a fairly complex encoding. But I'm also
not sure where in a stack you could actually handle unsafe characters in
HTTP_HOST -- it seems like simply an invalid request, and deferring the
error won't give another part of the stack the opportunity to do the right
thing.

In their quoted form all these values (at least including the quoted path,
not the unquoted SCRIPT_NAME/PATH_INFO) *should* be ASCII, and I believe a
WSGI server could ensure they were all ASCII without any loss of useful
information (either by simply rejecting the request or by applying
quoting). I don't see any place where bytes are advantageous. Representing
invalid requests does not seem particularly helpful -- *some* invalid
requests are useful to handle (e.g., weird cookies) but in the case of the
URL variables I don't see any benefit.

IMHO all the tricky encoding issues are in the request and response bodies,
and I'm pretty sure we have consensus that those should be bytes.

Reiterating other encoding issues I'm aware of:

Cookie encodings, but parsing cookies as bytes or Latin1 is basically
equivalent, and I don't believe that, for instance, they should ever be
parsed as UTF-8. Parsing as bytes might avoid an unnecessary
encoding/decoding, but it's all tricky enough that libraries should do it
anyway, and the encoding overhead alone isn't very important.

Another example is the Atom Title header (
http://bitworking.org/projects/atom/draft-ietf-atompub-protocol-08.html#rfc.section.8.1.2)
but that's supposed to be Latin1 with RFC2047 encodings, and I don't believe
anyone is proposing that RFC2047 encodings be handled generally at the WSGI
layer (I think CherryPy does or used to handle these, but there were many
objections at least on this list about it, in part due to security
concerns). A 2047 encoding is like "Title:
=?utf-8?q?stuff-with=-escaping?=".

Response headers are equivalent to request headers. Response status is
constrained by the spec to Latin1, and there are no use cases I know of
(even really obscure ones) where it would be necessary to use other
encodings.

And that's it! HTTP has a fairly finite amount of surface area.

--
Ian Bicking | http://blog.ianbicking.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

81 matches

Mail list logo