On 18.11.2016 20:27, Christopher Schultz wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Konstantin,
On 11/18/16 2:10 PM, Konstantin Kolinko wrote:
One more authority, that I forgot to mention in my mail: IANA
registry of mime types
Registry:
https://www.iana.org/assignments/media-types/media-types.xhtml
Registration entry for "application/x-www-form-urlencoded"
https://www.iana.org/assignments/media-types/application/x-www-form-ur
lencoded
-> Encoding considerations : 7bit
According to RFC defining this registry, it means that the data is
7-bit ASCII only. https://tools.ietf.org/html/rfc6838#section-4.8
Oh, that's the nail in the coffin.
application/x-www-form-urlencoded from W3C says "if the character
doesn't fit into the encoding of the message, it must be %-encoded"
but it never says what "the encoding of the message" actually is. My
worry was that it was mutable, and that UTF-8 was a valid encoding,
meaning that 0xc2 0xae on the wire would have been acceptable (rather
than %C2%AE).
If application/a-www-form-urlencoded is *absolutely* supposed to be
7-bit ASCII, then nothing above 0x7f can ever be legally transferred
across the wire when using that content-type.
This solves André's problem with this content-type where he wanted to
specify the charset to be used. It seems the standard defines the
character set: US-ASCII.
With respect, this is not only "André's problem".
This is a general problem (not only with Tomcat), which affects any and all users and web
application programmers and webserver developers, as soon as they are dealing with the
World at large, which effectively uses a lot of languages which cannot be represented by
the iso-latin-1 character set, and much less even by the US-ASCII character set.
It affect users, because many users still regularly see the data that they enter into web
application pages and submit to a server, being misinterpreted. (I cannot tell you how
many times, even nowadays, I fill-in my name in a web form, only to have it echoed back to
me as some variation of "andré"..)
As for web application and webserver developers, one only has to look at the archives of a
forum such as Tomcat's, to see how often and how regularly such issues come up, and keep
coming up over the years :
Sample from marc.info, tomcat-user :
period : 2016-02-08 / 2016-11-19
Total messages : 3582
Messages mentioning "encoding" : 164
Messages mentioning "character set" : 41
for comparison :
Messages mentioning "NIO" : 90
Messages mentioning "AJP" : 201
Messages mentioning "memory" : 258
Granted, this is not a very fine analysis. But all in all, it would tend to suggest that
this is not a "minor" issue : for Tomcat alone, it comes up just about as often as the
"memory usage" topic, and more often than either Connector above.
I would also posit that this being an English-language forum, the posters here would tend
to be predominently English-speaking developers, who are quite likely not the ones most
affected by such issues. So the above numbers are quite likely to be unrepresentative of
the number of people really affected by such matters.
And one could also look at the amount of code in applications and in Tomcat e.g., which is
dedicated to working around linked issues.
(Think "UseBodyEncodingForURL",
"org.apache.catalina.filters.AddDefaultCharsetFilter" etc.)
Basically what I'm saying is that this "posted-parameters-encoding-issue" is far from
being "licked", despite the fact that native English-speaking developers may have a
tendency to believe that it is.
The only problem now is that it's not clear how to turn %C2%AE into a
character because you have to know that UTF-8 and not Shift-JS or
whatever is being used.
-> Required parameters : No parameters -> Optional parameters : No
parameters
OK. So no charset= parameter is allowed. My advise to specify the
charset parameter was wrong.
No, it wasn't, not really. I believe that you were on a good track there.
It is the spec that is wrong, really.
One is allowed to question a spec if it appears wrong, or ?
After all, RFC means "Request For Comment".
Agreed: it is always against the spec(s) to specify a charset for any
MIME type that is not text/*.
Agreed. It just makes no sense for data that is not fundamentally "text".
(Whether some such text data has or not a MIME type whose designation starts with "text/"
is quite another matter. For example : the MIME type "application/ecmascript" refers to
text data (javascript code) - and allows a charset attribute - even though its type name
does not start with "text/"; there are many other types like that).
Though historically ~10 years ago I saw
"application/x-www-form-urlencoded;charset=UTF-8" Content-Type in
the wild.
Oh, I'm sure you saw it. I even tossed that into my client to see if
it would make a difference. Not surprisingly, it did not.
It was a web site authored in WML (Wireless Markup Language) and
accessed via WAP protocol by mobile phones.
(Specification reference for this WML/WAP usage:
http://technical.openmobilealliance.org/Technical/release_program/docs
/Browsing/V2_3-20070227-C/WAP-191-WML-20000219-a.pdf
Document title: WAP WML WAP-191-WML 19 February 2000
Wireless Application Protocol Wireless Markup Language
Specification Version 1.3
-> Page 30 of 110 (in Section "9.5.1 The Go Element"): There is a
table, where the following line is relevant:
Method: post Enctype: application/x-www-form-urlencoded Process:
[...] The Content-Type header must include the charset parameter to
indicate the character encoding.
I suspect that the above URL is not the official location of the
document. I found it through Googling. Official location should be
http://www.wapforum.org/what/technical.htm )
Apache Tomcat supports the use of charset parameter with
Content-Type application/x-www-form-urlencoded in POST requests.
Good for Tomcat. That /is/ the intelligent thing to do, MIME-type
notwithstanding.
Because if ever, clients such as standard web browsers would come to pay more attention
and apply this attribute, much of the current confusion would go away.
Even better would be, if the RFC for "application/x-www-form-urlencoded" would be amended,
to specify that this charset attribute SHOULD be provided, and that by default its value
would be "ISO-8859-1" (for now; but there is a good case to make it UTF-8 nowadays).
And the justification for this would be that undoubtedly in the practice, this MIME type
applies exclusively for *text* data anyway, and that at numerous other places in the HTTP
and WWW-related specifications, it already indicates that for text data, the character
set/encoding should be clearly specified.
I mean, quite obviously, the current definition saying that this MIME type, which is used
in millions of places to pass named text values from HTML <form>s to webservers, is to be
composed of character codes belonging to the US-ASCII alphabet exclusively, is hopelessly
out-of-date and is, in the real world, violated millions of times every day.
Or is there someone who would pretend that there are not hundreds of thousands of web
forms being submitted every day to webservers in Germany, France, Spain, etc using POSTs
with a Content-type "application/x-www-form-urlencoded", and that no parameter passed in
this way ever contains more than US-ASCII characters ?
In fact, if Tomcat was to strictly respect the MIME type definition of
"application/x-www-form-urlencoded" and thus, after percent-decoding the POST body,
interpret any byte of the resulting string strictly as being a character in the US-ASCII
character set, that /would/ instantly break thousands of applications.
Interesting. I suspect that's because there are practical situations
where "being liberal with what you accept" is more appropriate than
angrily demanding that all clients be 100% spec-compliant :)
The (illegal) charset parameter can only mean one thing: the character
encoding to use to assemble url-decoded bytes into an actual string
value (e.g. %C2%AE -> 0xc2 0xae -> "®" when using UTF-8).
Thanks for that final reference; it really does close the case on this
whole thing.
It does not really. That would just brush it under the carpet, again.
Addendum :
It seems that HTML 5 is (finally) trying to do something about this muddle :
- Starting from the MIME type registry of "application/x-www-form-urlencoded",
in
http://www.iana.org/assignments/media-types/application/x-www-form-urlencoded
- which says :
"
Interoperability considerations :
Rules for generating and processing application/x-www-form-urlencoded payloads are defined
in the HTML specification.
Published specification :
http://www.w3.org/TR/html is the relevant specification. Algorithms for encoding and
decoding are defined.
"
- and thus going to http://www.w3.org/TR/html ...
- which somehow leads to :
https://www.w3.org/TR/html/sec-forms.html#application-x-www-form-urlencoded-encoding-algorithm
- and from there to :
https://url.spec.whatwg.org/#concept-urlencoded-serializer
it would now seem (unless I misinterpret, which is a distinct possibility) that the
content of a "application/x-www-form-urlencoded" POST, *after* URL-percent-decoding,
*may* be a UTF-8 encoded Unicode string (it may also be something else).
(There is even a provision for including a hidden "_charset_" parameter naming the
charset/encoding. Yet another muddle ?)
(This also applies only to HTML 5 <form> documents, but let's skip this for a
moment).
Still, as far as I can tell, there is still to some extent the same "chicken-and-egg"
problem, in the sense that in order to parse the above parameter, one would first have to
decode the "application/x-www-form-urlencoded" POST body, using some character set.
For which one would need to know ditto character set before decoding.
To summarise :
In a POST in the "application/x-www-form-urlencoded" format, there is a body. This body
has a single part, and it cannot be other than text (it is in fact a "query-string"
composed of name/value pairs; only, it is put in the body of the request, instead of being
appended to the URL).
So the Content-Type header of the POST request would be the perfect logical place to add a
"charset" parameter, which would lift any uncertainty about the content of this
query-string, character-set wise. And by default for now it could be ISO-8859-1, to match
the majority of the rest of the WWW-related specs. (But it would *allow* the usage of any
other encoding).
I do not believe that this would break anything. For clients which do not provided this
charset attribute, the current muddled logic would still apply.
And it would certainly be simpler to implement, than the logic described in the HTML-5
document.
Pretty much the same solution applies to POSTs in the "multipart/form-data" format, where
each posted parameter already has its own section with a MIME header. Whenever one of
these parameters is text, it should specify a charset. (And if it doesn't, then the
current muddle applies).
The only remaining muddle is with the parameters passed inside the URL, as a
query-string.
But for those, one could apply for example the same mechanism as is already applied for
non-ASCII email header values (see https://tools.ietf.org/html/rfc2047). This is not
really ideal in terms of simplicity, but 1) the code exists and works and 2) it would
certainly be preferable to the current muddled situation and recurrent parameter encoding
problems. (And again, for clients which do not use this, then the current muddle applies).
Altogether, to me it looks like there are 2 bodies of experts, one on the HTML-and-client
side and one on the HTTP-and-webserver side (or maybe these are 4 bodies), who have not
really been talking to eachother constructively on this issue for years.
The result being that instead of agreeing on some simple rules, each one of them kind of
patched together its own separate set of rules (and a lot of complex software), to obtain
finally something which still does not really solve the interoperability problem
fundamentally.
The current situation is nothing short of ridiculous :
- there are many character sets/encodings in use, but most/all of them are clearly defined
and named
- there are millions of webservers, and billions of web clients
But fundamentally :
- currently, a client has no way to know for sure what character set/encoding it should
use, when it first tries to send some piece of text data to a webserver
- currently, a webserver has no way to know for sure in what character set/encoding a
client is sending text data to it
I'm sure that we can do better. But someone somewhere has to take the initiative. And
who better than an open-source software foundation whose products already dominates the
worldwide webserver market ?
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org