[PATCH] HTTP: Add new uri_normalization_percent_decode option
# HG changeset patch # User Michael Kourlas # Date 1676408746 18000 # Tue Feb 14 16:05:46 2023 -0500 # Node ID 129437ade41b14a584fb4b7558accc1b8dee7f45 # Parent cffaf3f2eec8fd33605c2a37814f5ffc30371989 HTTP: Add new uri_normalization_percent_decode option This patch addresses ticket #2225 by adding a new uri_normalization_percent_decode configuration option that controls which characters are percent-decoded by nginx as part of its URI normalization. The option has two values: "all" and "all-except-reserved". "all" is the default value and is the current behaviour. When the option is set to "all-except-reserved", nginx percent-decodes all characters except those in the reserved set defined by RFC 3986: reserved= gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" In addition, when "all-except-reserved" is used, nginx will not re-encode "%" from the request URI when it observes that it is part of a percent-encoded reserved character. When nginx percent-decodes reserved characters, this can often change the request URI's semantics, making it impossible to use a normalized URI for certain use cases. "uri_normalization_percent_decode" gives the configuration author the freedom to determine which reserved characters are semantically relevant and which are not. For example, consider the following location block, which handles part of a hypothetical API: location ~ ^/api/objects/[^/]+/subobjects(/.*)?$ { ... } Because nginx always normalizes "%2F" to "/", this location block will not match a path of /api/objects/sample%2Fname/subobjects, even if the API permits "/" to appear percent-encoded in the URI as part of object names. nginx will instead interpret this as /api/objects/sample/name/subobjects, a completely different path. Setting "uri_normalization_percent_decode" to "all-except-reserved" will leave "%2F" encoded, resulting in the expected behaviour. diff -r cffaf3f2eec8 -r 129437ade41b src/core/ngx_string.c --- a/src/core/ngx_string.c Thu Feb 02 23:38:48 2023 +0300 +++ b/src/core/ngx_string.c Tue Feb 14 16:05:46 2023 -0500 @@ -1487,7 +1487,8 @@ uintptr_t -ngx_escape_uri(u_char *dst, u_char *src, size_t size, ngx_uint_t type) +ngx_escape_uri(u_char *dst, u_char *src, size_t size, ngx_uint_t type, +ngx_uint_t skip_preencoded_type) { ngx_uint_t n; uint32_t *escape; @@ -1641,7 +1642,11 @@ n = 0; while (size) { -if (escape[*src >> 5] & (1U << (*src & 0x1f))) { +if ((escape[*src >> 5] & (1U << (*src & 0x1f))) +&& !(*src == '%' && size >= 3 + && ngx_escape_uri_skip_preencoded_character( + src + 1, skip_preencoded_type))) +{ n++; } src++; @@ -1652,7 +1657,11 @@ } while (size) { -if (escape[*src >> 5] & (1U << (*src & 0x1f))) { +if ((escape[*src >> 5] & (1U << (*src & 0x1f))) +&& !(*src == '%' && size >= 3 + && ngx_escape_uri_skip_preencoded_character( + src + 1, skip_preencoded_type))) +{ *dst++ = '%'; *dst++ = hex[*src >> 4]; *dst++ = hex[*src & 0xf]; @@ -1668,6 +1677,87 @@ } +ngx_uint_t +ngx_escape_uri_skip_preencoded_character(u_char *hex_component, +ngx_uint_t skip_preencoded_type) +{ +u_charch, decoded_ch; +uint32_t *skip; + +static uint32_t none[] = { +0x, /* */ + +/* ?>=< ;:98 7654 3210 /.-, +*)( '&%$ #"! */ +0x, /* */ + +/* _^]\ [ZYX WVUT SRQP ONML KJIH GFED CBA@ */ +0x, /* */ + +/* ~}| {zyx wvut srqp onml kjih gfed cba` */ +0x, /* */ + +0x, /* */ +0x, /* */ +0x, /* */ +0x /* */ +}; + +static uint32_t reserved_only[] = { +0x, /* */ + +/* ?>=< ;:98 7654 3210 /.-, +*)( '&%$ #"! */ +0xac009fda, /* 1010 1100 1001 1101 1010 */ + +/* _^]\ [ZYX WVUT SRQP ONML KJIH GFED CBA@ */ +0x2801, /* 0010 1000 0001 */ + +/* ~}| {zyx wvut srqp onml kjih gfed cba` */ +0x, /* */ + +0x, /* 0
Re: [PATCH] HTTP: Add new uri_normalization_percent_decode option
Hello, Thanks for your comments! Sorry for the delay in responding -- I was on vacation. > As far as I understand, it will irreversibly corrupt URIs with > double-encoded reserved characters. For example, "%252F" will > become "%2F" when proxying in the following configuration: > > location /foo/ { > proxy_pass http://upstream/foo/; > } I believe you are correct. When the "all-except-reserved" option is used, nginx will not recognize, during the re-encoding step, that the "%" in "%2F" was originally percent-encoded and not part of a percent-encoded "/". However, I think this issue can be addressed by renaming "all-except-reserved" to "all-except-percent-and-reserved" and adding "%" to the corresponding set of characters that is not decoded or encoded automatically when that option is used. "%252F" will then be left untouched throughout the entire process. > Further, requests to static files with (properly escaped) reserved > characters will simply fail, because nginx won't decode these > characters. For example, in the following trivial configuration a > request to "/foo%3Fbar" won't be decoded to match "/foo?bar" file > under the document root: > > location / { > # static files > } Yes, I agree it doesn't make sense to keep "%" and reserved characters percent-encoded when performing a filesystem lookup, because at that point we can be certain they're not being used as delimiters or anything other than data. However, I think this issue can be addressed by decoding any remaining percent-encoded characters at the point of filesystem lookup, as well as any other place where you think it is appropriate. (Of course, this should only be done when using the "all-except-percent-and-reserved" option, and only with "%" and reserved characters, to avoid double-decoding double-encoded characters.) > Please also note that the configuration directive you've > introduced in this patch applies to URI parsing from not-yet-final > server block (see [1] for details), but the configuration from the > final server block will be used for URI escaping. These > configuration can be different, and this might result in various > additional issues. I think this issue can be addressed by making the new directive available only in http blocks. This ensures that all virtual servers share the same configuration. If you agree with these suggestions, I'd be happy to submit an updated patch. Reserved characters are often used as delimiters in APIs, and I believe it is important that nginx be able to distinguish between usages as delimiters and usages as data. Thanks, Michael Kourlas Confidentiality notice This e-mail message and any attachment hereto contain confidential information which may be privileged and which is intended for the exclusive use of its addressee(s). If you receive this message in error, please inform sender immediately and destroy any copy thereof. Furthermore, any disclosure, distribution or copying of this message and/or any attachment hereto without the consent of the sender is strictly prohibited. Thank you. ___ nginx-devel mailing list nginx-devel@nginx.org https://mailman.nginx.org/mailman/listinfo/nginx-devel
RE: [PATCH] HTTP: Add new uri_normalization_percent_decode option
Hello, Thanks again for your comments. > This implies, basically, that there are 3 forms of the request > URI: 1) fully encoded, as in $request_uri, 2) fully decoded, as in > $uri now, and 3) "all-except-percent-and-reserved". To implement this > correctly, it needs clear definition when each form is used, and > it is going to be a non-trivial task to do this safely. I agree. A simple way to do this would be to make percent-decoding customizable on a per-directive basis. The core use case I was hoping to support is preserving encoded reserved characters in location matching (basically what was proposed in [1]), so that is what I would like to focus on in a reworked version of this patch. I propose the following: (1) The addition of a new variable called $uri_encoded_percent_and_reserved. As discussed, this variable is a special version of the normalized URI ($uri) that preserves any percent-encoded "%" or reserved characters. (2) Every transformation applied to $uri (e.g. from the "rewrite" directive, internal redirects, etc.) is automatically applied to $uri_encoded_percent_and_reserved as well. If this raises performance concerns, a new flag could be added to enable or disable the availability of $uri_encoded_percent_and_reserved. (3) The addition of a new optional parameter to the URI form of "location" blocks called "match-source": location [ = | ~ | ~* | ^~ ] uri [match-source=uri|uri-encoded-percent-and-reserved] { ... } For example: location ~ ^/api/objects/[^/]+/subobjects(/.*)?$ match-source=uri-encoded-percent-and-reserved { ... } "match-source=uri" is the default and the current behaviour. When "uri-encoded-percent-and-reserved" is used, the location matching for that block uses $uri_encoded_percent_and_reserved rather than $uri. Nested location blocks are not affected (unless they also use "uri-encoded-percent-and-reserved"). In future it would be possible to use a similar pattern with other directives that use $uri, such as "proxy_pass", but that can be done as part of a separate patch. If you think this is a sensible approach, I will submit a revised patch implementing it. Thanks, Michael Kourlas [1] https://trac.nginx.org/nginx/ticket/2225 Confidentiality notice This e-mail message and any attachment hereto contain confidential information which may be privileged and which is intended for the exclusive use of its addressee(s). If you receive this message in error, please inform sender immediately and destroy any copy thereof. Furthermore, any disclosure, distribution or copying of this message and/or any attachment hereto without the consent of the sender is strictly prohibited. Thank you. ___ nginx-devel mailing list nginx-devel@nginx.org https://mailman.nginx.org/mailman/listinfo/nginx-devel
RE: [PATCH] HTTP: Add new uri_normalization_percent_decode option
Hello, Thanks again for your feedback. > Consider the following rewrite: > > rewrite ^/(.*) /$1 break; > > Assuming request to "GET /foo%2fbar/", what > $uri_encoded_percent_and_reserved do you expect after each of > these rewrites? I do not think that rewrite does anything in practice. Following the rewrite, I would expect $uri to remain unchanged at its current value of "/foo/bar/" and $uri_encoded_percent_and_reserved to similarly remain unchanged at its current value of "/foo%2fbar/". > Similarly, consider the following rewrite: > > rewrite ^/foo/(.*) /$1 break; > > What $uri_encoded_percent_and_reserved is expected after the > rewrite? In this case the regular expression matches $uri but not $uri_encoded_percent_and_reserved. One could say that this just means that only $uri is updated, but that has the potential to cause confusion when a flag is used that changes the control flow (unless the user explicitly opts into this behaviour). This could be addressed by adding a "match-source" optional argument to "rewrite" with three values (and a default of "uri"): * "uri" - rewrite directive matches and changes $uri only * "uri_encoded_percent_and_reserved" - rewrite directive matches and changes $uri_encoded_percent_and_reserved only * "all" - rewrite directive matches and changes both (if only one is matched, directive is not applied) It might also be a good idea to add "uri_encoded_percent_and_reserved_regex" and "uri_encoded_percent_and_reserved_replacement" arguments to be used with "all", so that it is possible to use the same directive and flag even when needing to perform slightly different rewrites for $uri versus $uri_encoded_percent_and_reserved. > Consider the following configuration: > > location /foo%2fbar/ match-source=uri-encoded-percent-and-reserved { > ... > } > > location /foo/bar/ match-source=uri { > ... > } > > The question is: which location is expected to be matched for the > request "GET /foo%2fbar/"? Both blocks match their respective variables. Since the first block has the longest matching prefix, I expect it will be selected. > Which location is expected to be matched for the request "GET > /foo%2Fbar/" (note that it is exactly equivalent to "GET > /foo%2fbar/"). Only the second block matches its variable, so I expect it will be selected. Although paths are generally case sensitive, a percent-encoded character is not supposed to be, so this behaviour is unfortunate. One possibility is to automatically use case-insensitive matching for any part of a location prefix using "match-source=uri-encoded-percent-and-reserved" that is a percent encoded "%" or reserved character. Alternatively this behaviour could be documented with an instruction to use regular expressions instead. > Assuming static handling in the locations, what happens with the > request "GET /foo%2fbar/..%2fbazz"? The first block would be used. However, $uri would be used for static handling, so the path "/foo/bazz" would be looked up on the filesystem, not "/foo%2fbar/..%2fbazz". > Note that the behaviour does not seem to be obvious, and it is an > open question if it can be clarified to be safe. Fair enough. I am certainly happy to continue making changes to my proposal to address the specific concerns you raise. However, are you saying that you have broader overall concerns about safety, complexity, etc. that make a patch implementing the proposal unlikely to be accepted, even if all specific concerns are addressed? Thanks, Michael Kourlas Confidentiality notice This e-mail message and any attachment hereto contain confidential information which may be privileged and which is intended for the exclusive use of its addressee(s). If you receive this message in error, please inform sender immediately and destroy any copy thereof. Furthermore, any disclosure, distribution or copying of this message and/or any attachment hereto without the consent of the sender is strictly prohibited. Thank you. ___ nginx-devel mailing list nginx-devel@nginx.org https://mailman.nginx.org/mailman/listinfo/nginx-devel
RE: [PATCH] HTTP: Add new uri_normalization_percent_decode option
Hello, Thanks for your lengthy explanation -- it's much appreciated. I'll find a way to support my use case in the upstream server instead. Best, Michael Kourlas Confidentiality notice This e-mail message and any attachment hereto contain confidential information which may be privileged and which is intended for the exclusive use of its addressee(s). If you receive this message in error, please inform sender immediately and destroy any copy thereof. Furthermore, any disclosure, distribution or copying of this message and/or any attachment hereto without the consent of the sender is strictly prohibited. Thank you. ___ nginx-devel mailing list nginx-devel@nginx.org https://mailman.nginx.org/mailman/listinfo/nginx-devel