Re: URL-encoding and "#"

Mark Thomas Fri, 13 Oct 2017 11:29:16 -0700

On 13/10/17 18:42, André Warnier (tomcat) wrote:
> On 13.10.2017 19:29, Mark Thomas wrote:
>> On 13/10/2017 18:15, André Warnier (tomcat) wrote:
>>> On 13.10.2017 18:17, Mark Thomas wrote:
>>>> On 13/10/2017 17:09, James H. H. Lampert wrote:
>>>>> Thanks to all of you who responded.
>>>>>
>>>>> I found a web page that explains it in ways that I can wrap my
>>>>> 55-year-old brain around, and has an easy-to-read reference chart.
>>>>>
>>>>> https://perishablepress.com/stop-using-unsafe-characters-in-urls/
>>>>>
>>>>> Question: the problem first showed up on a web service that takes a
>>>>> "bodyless" POST operation, and I assume it also applies to GET
>>>>> operations, and to the URL portion of a POST with a body.
>>>>>
>>>>> But what about the body of a POST?
>>>>
>>>>   From an HTTP specification point of view, anything goes.
>>>
>>> With respect, I believe that "anything goes" is a bit imprecise here.
>>
>> Nope.
>>
>> You can POST anything. You are talking specifically about form data.
> 
> Mmm. You are being a bit casuistic here. (Granted, not that I wasn't.)


Yeah, sorry about that. I tend to read "With respect..." as meaning
pretty much exactly the opposite.

> In the real world, I would expect that 99% of what is ever POSTed, /is/
> form data.
> Not you ?

For Tomcat I don't have a clue what the split is but my guess is that is
it a lot less than 99% these days.

>  In
>> that case, as I said, the body has to conform to what the component
>> processing it expects.
> 
> And that component would be .. ?

https://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/tomcat/util/http/Parameters.java?view=annotate

> I don't really know, but I would guess that in most webservers, the
> component parsing the body of a POST with Content-type =
> application/x-www-form-urlencoded, may be the same as the one which is
> parsing the query-string of a URI, no ?
> Considering the similarity of these two things, it would seem that the
> temptation would be hard to resist.

Tomcat uses exactly the same code - with a little wrapping to get the
data into the same format before it starts.

Mark


>> And yes, unicode in form data is 'interesting'...
>>
>> Mark
>>
>>
>>> See e.g. https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
>>>
>>> There are 2 ways for a user agent to send the content of a HTTP POST :
>>> 1) with Content-type header = application/x-www-form-urlencoded
>>> or
>>> 2) with Content-type header = multipart/form-data
>>>
>>> and while it is true that in the case (2), any submitted key=value pair
>>> would be sent separately 'as is', this would not necessarily be so in
>>> case (1), because then all key=value pairs would be concatenated into
>>> one long string, in which the different key=value pairs would be
>>> separated by (unescaped) "&" signs.
>>> (Apart from other required encodings, see the page above)
>>> So if the client is not a browser, and "composes" itself the POST body
>>> before sending it, and sends it with a Content-type (1), it had better
>>> encode the individual parameter pairs as described, before concatenating
>>> them, because that is what the server would expect.
>>>
>>> As an additional note, if it so happened that the data in the client
>>> could contain Unicode text, do not forget that this is (still) not the
>>> standard in HTTP (and URI's, and thus query-string-like things), and
>>> make sure that you use the proper method to encode any printable
>>> characters which are not purely US-ASCII.  Again, browsers generally do
>>> this correctly, but custom clients not necessarily. (And a "custom
>>> client" in this case, could even be a bit of javascript which is
>>> embedded in one of your own pages, but does its own calls to the server
>>> on the side).
>>>
>>> I just recently got bitten by this, even in a quite recent browser,
>>> where some javascript function was composing a POST to a server (using
>>> type (1) above), and was NOT doing it correctly, even though the page
>>> containing and calling this function was itself declared as
>>> Unicode/UTF-8.
>>> (that was with (and I am too sorely tempted to add "of course" to resist
>>> it) some revision of IE-11 - although other revisions of the same
>>> browser did not exhibit that same issue).
>>>
>>> [...]
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>>> For additional commands, e-mail: users-h...@tomcat.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>> For additional commands, e-mail: users-h...@tomcat.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: URL-encoding and "#"

Reply via email to