Hi,

I just want to reply to this because I think many people seem to be missing why things are done in a certain way. Especially if the appear to be odd.

On 05/01/2016 12:26, Cory Benfield wrote:
1. WSGI is prone to header injection vulnerabilities issues by
designdue to the conversion of HTTP headers to CGI-style environment
variables: if the server doesn’t specifically prevent it, X-Foo and
X_Foo both become HTTP_X_Foo. I don’t believe it’s a good choice to
destructively encode headers, expect applications to undo the damage
somehow, and introduce security vulnerabilities in the process. If
mimicking CGI is still considered a must-have — 1% of current Python web
programmers may have heard about it, most of them from PEP 3333 — then
that burden should be pushed onto the server, not the application.
Headers always will have to be encoded destructively if you want any form of generic processing. We need header joining, we need to normalize the keys already at least to the extend of the HTTP specification. I'm happy to not perform the conversion of dashes to underscores but you will work in environments where this conversion was already done so the spec will need to deal with that case anyways.

The WSGI spec currently also does not sufficiently explain how to join headers. In particular the cookie header was written without header joining in mind which is why it needs to be joined differently than all other headers. Header joining also comes up as a big topic in HTTP 2
so the spec will need to cover this.

2. More generally, I fail to see how mixing HTTP headers,
server-related inputs, and environment variables in a dict adds
values. It prevents iterating on each collection separately. It only
makes sense if not offering more features than CGI is a design goal;
in that case, this discussion doesn’t serve a purpose anyway. It
would be nicer and possibly more secure if the application received
separately:
I think this is largely a nice to have, not something that has any overall benefits. I rather just clean up the actual stupid things such as CONTENT_TYPE and CONTENT_LENGTH which cause a lot more real world friction than just the names of keys in general. This really should not turn into meaningless bikeshedding about what information should be called. Also consider how much code out there already assumes CGI/WSGI variables so any move off that really should have good reasons or we all will just waste enormous amounts just to transpose between the two representations.

a. Configuration information, which servers could read from
environment variables by default for backwards compatibility, but could
also get through more secure channels and restrict to what the
application needs in order to better isolate it from the entire OS.
What WSGI traditionally lacked was a setup phase where data could be passed to the application that was server specific but not request bound. For instance there is no reason an application cannot get hold of wsgi.errors before a request comes in. I would like to see this fixed in a new specification.

3. Stop pretending that HTTP is a unicode protocol, or at least stop
ignoring reality when doing so. WSGI enforces ISO-8859-1-decoded str
objects in the environ, which is just wrong. It’s all the more a
surprising choice since this change was driven by Python 3, that UTF-8
is the correct choice, and that Python 3 defaults to UTF-8. Django has
to re-encode and re-decode before doing anything with HTTP headers:
I agree with this but you will have to have that fight with others. I said many times before that values should never have been unicode values in the first place but certain decisions in the Python 3 standard library at the time prevented that. In particular until 3.2 or so it was impossible to parse byte URLs.

5. Improve request / response length handling and connection closure.
Armin and Graham have talked about in the past and know the topic
better than I do. There’s also a rejected PEP by Armin which made
sense to me.
I think last time I discussed that with Graham it was not clear what the solution is in the context of WSGI. The idea that there is a content-length is laughable in the context of a real application where the server is performing conversions on the input and output stream. We would need many more than just one content length and an automatically terminated input stream.

However at that point you will quickly realize that you can't have it both ways and you either have a WSGI like protocol, or raw access to sockets but certainly not both. This topic has caused a lot of bikeshedding in the past and I fail to see how it will be differently this time.

My current thinking is that the most realistic approach to most of those problems will be the concept of framing on both the input and output side. That's somewhat compatible with both chunked transports well as websockets. But if we do go down this road we will most likely have to standardize on a library that implements WSGI as the complexity of dealing with this sort of stuff is significantly higher than what we had to do in the past.


Regards,
Armin
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
https://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to