> On 6 Jan 2016, at 12:09 AM, chris.d...@gmail.com wrote:
>
> As someone who writes their WSGI applications as functions that take
> `start_response` and `environ` and doesn't bother with much
> framework the things I would like to see in a minor revision to WSGI
> are:
>
> * A consistent way to access the raw un-decoded request URI. This is
> so I can reconstruct a realistic `PATH_INFO` that has not been
> subjected to destructive handling by the server (e.g. apache
> messing with `%2F`) before continuing on to a route dispatcher.
This is already available in some servers by way of the REQUEST_URI value.
This is the original first line of any HTTP request and can be split apart to
get the path.
The problem is that you cannot easily use it unless you want to replicate
normalisations that the underlying server may do.
The key problem is working out where SCRIPT_NAME ends and PATH_INFO starts with
the original path given in REQUEST_URI.
Sure if you only deal with a web application mounted at the root of the host it
is easier because SCRIPT_NAME would be empty, but when mounted at a sub URL it
gets trickier.
This is because a web server will eliminate things like repeating slashes in
the part of the path that may match the mount point (sub url) for the web
application. The sub url here could be dictated by what is defined in a
configuration file, or could instead be due to matching against a file system
path.
Further, the web server will eliminate attempts at relative directory traversal
using ‘..’ and ‘.’.
So an original path may be something like:
/a/b//c/../d/.//e/../f/g/h
If the mount point was ‘/a/b/d’, then that is what gets passed through
SCRIPT_NAME.
Now if you instead go to the raw path you would need to replicate all the
normalisations. Only then could you maybe based on length of SCRIPT_NAME,
number of component parts, or actual components in the path, try and calculate
where SCRIPT_NAME ended and PATH_INFO started in the raw path.
This will still all fail if a web server does internal rewrites though, as the
final SCRIPT_NAME may not even match the raw path, although at that point URL
reconstruction can be a problem as well if what the application is given by way
of the rewrite isn’t a public path.
I have only looked at SCRIPT_NAME. Even in PATH_INFO servers will apply same
sort of normalisations.
So even this isn’t so simple to do properly if you want to go back and do it
yourself using the raw path.
I have never seen anyone trying to extract repeating slashes intact out of a
raw path even attempt to do it properly. They tend to assume that the raw path
is pure and doesn’t have stuff in it which needs to be normalised and that
rewrites aren’t occurring. As a result they assume that they can just strip
number of characters off raw path based on length of SCRIPT_NAME passed
through. This will be fragile though if the raw path isn’t pure.
Graham
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe:
https://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com