On Fri, 2020-03-13 at 16:02 +0100, Michael Osipov wrote:
> Am 2020-03-13 um 15:35 schrieb Mark Thomas:
> > Hi all,
> > 
> > I am writing this up as this is a change I'd like to make in Tomcat
> > 10
> > that I think is important to get right. It may also get back-
> > ported.
> > 
> > This first arose in this mod_jk bug:
> > https://bz.apache.org/bugzilla/show_bug.cgi?id=62459
> > 
> > Ignoring the mod_jk aspects for now (they will come later) the bug
> > report raises the important question of how to handle the case
> > where the
> > ID for a resource in a RESTful API includes a "/".
> > 
> > At the moment, Tomcat does not handle this correctly. If
> > ALLOW_ENCODED_SLASH is false, the request is rejected. If it is
> > true,
> > the wrong resource identifier will be used. This is an edge case,
> > but
> > one I'd like to fix.
> > 
> > My research led me back to RFC 3986. Quoting from section 2.2:
> > 
> > <quote>
> > The purpose of reserved characters is to provide a set of
> > delimiting
> > characters that are distinguishable from other data within a URI.
> > URIs that differ in the replacement of a reserved character with
> > its
> > corresponding percent-encoded octet are not equivalent.  Percent-
> > encoding a reserved character, or decoding a percent-encoded octet
> > that corresponds to a reserved character, will change how the URI
> > is
> > interpreted by most applications.  Thus, characters in the reserved
> > set are protected from normalization and are therefore safe to be
> > used by scheme-specific and producer-specific algorithms for
> > delimiting data subcomponents within a URI.
> > </quote>
> > 
> > My reading of this is that there are some %nn sequences that we
> > should
> > *never* decode. The values we pass to applications for ServletPath,
> > PathInfo etc. should still include these %nn sequences and the
> > application should decode them.
> > 
> > My next thought was "Which %nn sequences should be leave alone?".
> > That
> > got me thinking about URIEncoding values and how to differentiate
> > between a %nn sequence we wanted to leave alone and the same
> > sequence
> > appearing where a code point is represented by multiple bytes.
> > Fortunately, RFC7230 saves us from that complication as it requires
> > all
> > encodings to be supersets of US-ASCII. Or to put is another way,
> > the
> > only time %nn appears where nn is in the range 00 to 7F that %nn
> > sequence will *always* be representing the equivalent US-ASCII code
> > point.
> > 
> > So, that simplifies things a little as we go back to considering
> > which
> > %nn sequences we have to leave alone.
> > 
> > The starting point is "reserved" characters. From RFC 3986:
> > 
> > reserved    = gen-delims / sub-delims
> > 
> > gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
> > 
> > sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
> >              / "*" / "+" / "," / ";" / "="
> > 
> > We are talking about URIs in Tomcat which, at the point we %nn
> > decode,
> > is just the path. The path parameters and query string have been
> > removed.
> > 
> > > From RFC 7230:
> > 
> > absolute-path = 1*( "/" segment )
> > 
> > and from RFC 3986:
> > 
> > segment       = *pchar
> > 
> > pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
> > 
> > 
> > So the question is, which reserved characters cannot be safely
> > decoded
> > from their %nn form.
> > 
> > We know all subdelims because:
> > - they are valid characters in a segment and with the query string
> > and
> >    path parameters removed, none of those characters have special
> > meaning
> > 
> > That leaves gen-delims
> > 
> > Of those ":" and "@" are explicitly allowed in a segment. So that
> > leaves:
> > 
> > "/" "?" "#" "[" "]"
> > 
> > "?" is the query delimiter but the query string has been removed so
> > it
> > is safe to %nn decode to "?".
> > 
> > "#" is the fragment delimiter. The fragment will never reach the
> > server
> > so it is safe to %nn decode to "#".
> > 
> > "[" and "]" are delimiters in the host but not the path so they are
> > safe.
> > 
> > That just leaves "/".
> > 
> > My proposal is, therefore, actually very simple:
> > 
> > 1. Remove the UDecoder.ALLOW_ENCODED_SLASH option.
> > 2. Replace it with a per Connector setting that has three options:
> >     a) deny (equivalent to ALLOW_ENCODED_SLASH="false")
> >     b) decode (equivalent to ALLOW_ENCODED_SLASH="true")
> >     c) allow (leaves as is)
> 
> I am CC'ing our expert olegk@ on this topic because at HttpComponents
> we 
> had numerous JIRA issues regarding the handling and RFC 3986 
> interpretation. It is, sadly, a constant source of trouble.
> 
> Oleg, can you share your view on Mark's proposal?
> 
> Michael
> 

Michael et al

I am not really qualified to comment on the proposal but as far as I
can see it makes sense.

Client side libraries however have an extra problem to contend with. 

Most of the time HttpClient just passes the request URI to server as
is, exactly as specified by the user. 

There is a catch though. Absolute request URIs need to be parsed and
split into respective authority and path/query components. 

http://host:8080/stuff/123 

---
GET /stuff/123 HTTP/1.1
Host: host:8080
---

However there is a fringe case that can cause creation of ambiguous or
illegal request messages.

What is one supposed to do with absolute request URIs like this one?

http://host:8080//stuff///123 

There does not appear to be any statement in the RFC7230 as to what the
expected behavior should be.

What we presently do in both 4.x and 5.x release lines is normalizing
such request URIs by collapsing multiple consecutive forward slashes in
the path component into a single one.

By default HttpClient would generate this request as 
---
GET /stuff/123 HTTP/1.1
Host: host:8080
---

and not as 
---
GET //stuff///123 HTTP/1.1
Host: host:8080
---

That should not be relevant as far as this proposal is concerned but I
thought I should mention it just in case as handling of consecutive
forward slashes in the path component of request URIs had caused us a
lot of grief in the past. 

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to