> On 6 Jan 2016, at 12:09 AM, chris.d...@gmail.com wrote:
> 
> As someone who writes their WSGI applications as functions that take
> `start_response` and `environ` and doesn't bother with much
> framework the things I would like to see in a minor revision to WSGI
> are:
> 
> * A consistent way to access the raw un-decoded request URI. This is
>  so I can reconstruct a realistic `PATH_INFO` that has not been
>  subjected to destructive handling by the server (e.g. apache
>  messing with `%2F`) before continuing on to a route dispatcher.

This is already available in some servers by way of the REQUEST_URI value.

This is the original first line of any HTTP request and can be split apart to 
get the path.

The problem is that you cannot easily use it unless you want to replicate 
normalisations that the underlying server may do.

The key problem is working out where SCRIPT_NAME ends and PATH_INFO starts with 
the original path given in REQUEST_URI.

Sure if you only deal with a web application mounted at the root of the host it 
is easier because SCRIPT_NAME would be empty, but when mounted at a sub URL it 
gets trickier.

This is because a web server will eliminate things like repeating slashes in 
the part of the path that may match the mount point (sub url) for the web 
application. The sub url here could be dictated by what is defined in a 
configuration file, or could instead be due to matching against a file system 
path.

Further, the web server will eliminate attempts at relative directory traversal 
using ‘..’ and ‘.’.

So an original path may be something like:

    /a/b//c/../d/.//e/../f/g/h

If the mount point was ‘/a/b/d’, then that is what gets passed through 
SCRIPT_NAME.

Now if you instead go to the raw path you would need to replicate all the 
normalisations. Only then could you maybe based on length of SCRIPT_NAME, 
number of component parts, or actual components in the path, try and calculate 
where SCRIPT_NAME ended and PATH_INFO started in the raw path.

This will still all fail if a web server does internal rewrites though, as the 
final SCRIPT_NAME may not even match the raw path, although at that point URL 
reconstruction can be a problem as well if what the application is given by way 
of the rewrite isn’t a public path.

I have only looked at SCRIPT_NAME. Even in PATH_INFO servers will apply same 
sort of normalisations.

So even this isn’t so simple to do properly if you want to go back and do it 
yourself using the raw path.

I have never seen anyone trying to extract repeating slashes intact out of a 
raw path even attempt to do it properly. They tend to assume that the raw path 
is pure and doesn’t have stuff in it which needs to be normalised and that 
rewrites aren’t occurring. As a result they assume that they can just strip 
number of characters off raw path based on length of SCRIPT_NAME passed 
through. This will be fragile though if the raw path isn’t pure.

Graham
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
https://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to