Re: Sanity Check

tomcat Sat, 19 Nov 2016 09:31:02 -0800

On 18.11.2016 20:27, Christopher Schultz wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256


Konstantin,

On 11/18/16 2:10 PM, Konstantin Kolinko wrote:

One more authority, that I forgot to mention in my mail: IANA
registry of mime types

Registry:
https://www.iana.org/assignments/media-types/media-types.xhtml

Registration entry for "application/x-www-form-urlencoded"
https://www.iana.org/assignments/media-types/application/x-www-form-ur

lencoded


  -> Encoding considerations : 7bit

According to RFC defining this registry, it means that the data is
7-bit ASCII only. https://tools.ietf.org/html/rfc6838#section-4.8


Oh, that's the nail in the coffin.

application/x-www-form-urlencoded from W3C says "if the character
doesn't fit into the encoding of the message, it must be %-encoded"
but it never says what "the encoding of the message" actually is. My
worry was that it was mutable, and that UTF-8 was a valid encoding,
meaning that 0xc2 0xae on the wire would have been acceptable (rather
than %C2%AE).

If application/a-www-form-urlencoded is *absolutely* supposed to be
7-bit ASCII, then nothing above 0x7f can ever be legally transferred
across the wire when using that content-type.

This solves André's problem with this content-type where he wanted to
specify the charset to be used. It seems the standard defines the
character set: US-ASCII.


With respect, this is not only "André's problem".

This is a general problem (not only with Tomcat), which affects any and all users and webapplication programmers and webserver developers, as soon as they are dealing with theWorld at large, which effectively uses a lot of languages which cannot be represented bythe iso-latin-1 character set, and much less even by the US-ASCII character set.It affect users, because many users still regularly see the data that they enter into webapplication pages and submit to a server, being misinterpreted. (I cannot tell you howmany times, even nowadays, I fill-in my name in a web form, only to have it echoed back tome as some variation of "andrÃ©"..)As for web application and webserver developers, one only has to look at the archives of aforum such as Tomcat's, to see how often and how regularly such issues come up, and keepcoming up over the years :


Sample from marc.info, tomcat-user :
period : 2016-02-08 / 2016-11-19
Total messages : 3582
Messages mentioning "encoding" : 164
Messages mentioning "character set" : 41

for comparison :
Messages mentioning "NIO" : 90
Messages mentioning "AJP" : 201
Messages mentioning "memory" : 258

Granted, this is not a very fine analysis. But all in all, it would tend to suggest thatthis is not a "minor" issue : for Tomcat alone, it comes up just about as often as the"memory usage" topic, and more often than either Connector above.I would also posit that this being an English-language forum, the posters here would tendto be predominently English-speaking developers, who are quite likely not the ones mostaffected by such issues. So the above numbers are quite likely to be unrepresentative ofthe number of people really affected by such matters.

And one could also look at the amount of code in applications and in Tomcat e.g., which isdedicated to working around linked issues.

(Think "UseBodyEncodingForURL", 
"org.apache.catalina.filters.AddDefaultCharsetFilter" etc.)

Basically what I'm saying is that this "posted-parameters-encoding-issue" is far frombeing "licked", despite the fact that native English-speaking developers may have atendency to believe that it is.


The only problem now is that it's not clear how to turn %C2%AE into a
character because you have to know that UTF-8 and not Shift-JS or
whatever is being used.

-> Required parameters : No parameters -> Optional parameters :  No
parameters

OK. So no charset= parameter is allowed. My advise to specify the
charset parameter was wrong.


No, it wasn't, not really.  I believe that you were on a good track there.
It is the spec that is wrong, really.

One is allowed to question a spec if it appears wrong, or ?
After all, RFC means "Request For Comment".


Agreed: it is always against the spec(s) to specify a charset for any
MIME type that is not text/*.


Agreed. It just makes no sense for data that is not fundamentally "text".

(Whether some such text data has or not a MIME type whose designation starts with "text/"is quite another matter. For example : the MIME type "application/ecmascript" refers totext data (javascript code) - and allows a charset attribute - even though its type namedoes not start with "text/"; there are many other types like that).

Though historically ~10 years ago I saw
"application/x-www-form-urlencoded;charset=UTF-8" Content-Type in
the wild.


Oh, I'm sure you saw it. I even tossed that into my client to see if
it would make a difference. Not surprisingly, it did not.

It was a web site authored in WML (Wireless Markup Language) and
accessed via WAP protocol by mobile phones.

(Specification reference for this WML/WAP usage:
http://technical.openmobilealliance.org/Technical/release_program/docs

/Browsing/V2_3-20070227-C/WAP-191-WML-20000219-a.pdf


  Document title: WAP WML WAP-191-WML 19 February 2000

Wireless Application Protocol Wireless Markup Language
Specification Version 1.3

-> Page 30 of 110 (in Section "9.5.1 The Go Element"): There is a
table, where the following line is relevant:

Method: post Enctype: application/x-www-form-urlencoded Process:
[...] The Content-Type header must include the charset parameter to
indicate the character encoding.

I suspect that the above URL is not the official location of the
document. I found it through Googling. Official location should be
http://www.wapforum.org/what/technical.htm )


Apache Tomcat supports the use of charset parameter with
Content-Type application/x-www-form-urlencoded in POST requests.


Good for Tomcat.  That /is/ the intelligent thing to do, MIME-type 
notwithstanding.

Because if ever, clients such as standard web browsers would come to pay more attentionand apply this attribute, much of the current confusion would go away.

Even better would be, if the RFC for "application/x-www-form-urlencoded" would be amended,to specify that this charset attribute SHOULD be provided, and that by default its valuewould be "ISO-8859-1" (for now; but there is a good case to make it UTF-8 nowadays).And the justification for this would be that undoubtedly in the practice, this MIME typeapplies exclusively for *text* data anyway, and that at numerous other places in the HTTPand WWW-related specifications, it already indicates that for text data, the characterset/encoding should be clearly specified.

I mean, quite obviously, the current definition saying that this MIME type, which is usedin millions of places to pass named text values from HTML <form>s to webservers, is to becomposed of character codes belonging to the US-ASCII alphabet exclusively, is hopelesslyout-of-date and is, in the real world, violated millions of times every day.Or is there someone who would pretend that there are not hundreds of thousands of webforms being submitted every day to webservers in Germany, France, Spain, etc using POSTswith a Content-type "application/x-www-form-urlencoded", and that no parameter passed inthis way ever contains more than US-ASCII characters ?

In fact, if Tomcat was to strictly respect the MIME type definition of"application/x-www-form-urlencoded" and thus, after percent-decoding the POST body,interpret any byte of the resulting string strictly as being a character in the US-ASCIIcharacter set, that /would/ instantly break thousands of applications.

Interesting. I suspect that's because there are practical situations
where "being liberal with what you accept" is more appropriate than
angrily demanding that all clients be 100% spec-compliant :)

The (illegal) charset parameter can only mean one thing: the character
encoding to use to assemble url-decoded bytes into an actual string
value (e.g. %C2%AE -> 0xc2 0xae -> "®" when using UTF-8).

Thanks for that final reference; it really does close the case on this
whole thing.


It does not really. That would just brush it under the carpet, again.

Addendum :
It seems that HTML 5 is (finally) trying to do something about this muddle :
- Starting from the MIME type registry of "application/x-www-form-urlencoded", 
in
  http://www.iana.org/assignments/media-types/application/x-www-form-urlencoded
- which says :
"
Interoperability considerations :

Rules for generating and processing application/x-www-form-urlencoded payloads are definedin the HTML specification.


Published specification :

http://www.w3.org/TR/html is the relevant specification. Algorithms for encoding anddecoding are defined.

"
- and thus going to http://www.w3.org/TR/html ...

- which somehow leads to :https://www.w3.org/TR/html/sec-forms.html#application-x-www-form-urlencoded-encoding-algorithm

- and from there to :
https://url.spec.whatwg.org/#concept-urlencoded-serializer

it would now seem (unless I misinterpret, which is a distinct possibility) that thecontent of a "application/x-www-form-urlencoded" POST, *after* URL-percent-decoding,

*may* be a UTF-8 encoded Unicode string (it may also be something else).

(There is even a provision for including a hidden "_charset_" parameter naming thecharset/encoding. Yet another muddle ?)

(This also applies only to HTML 5 <form> documents, but let's skip this for a 
moment).

Still, as far as I can tell, there is still to some extent the same "chicken-and-egg"problem, in the sense that in order to parse the above parameter, one would first have todecode the "application/x-www-form-urlencoded" POST body, using some character set.

For which one would need to know ditto character set before decoding.

To summarise :

In a POST in the "application/x-www-form-urlencoded" format, there is a body. This bodyhas a single part, and it cannot be other than text (it is in fact a "query-string"composed of name/value pairs; only, it is put in the body of the request, instead of beingappended to the URL).So the Content-Type header of the POST request would be the perfect logical place to add a"charset" parameter, which would lift any uncertainty about the content of thisquery-string, character-set wise. And by default for now it could be ISO-8859-1, to matchthe majority of the rest of the WWW-related specs. (But it would *allow* the usage of anyother encoding).I do not believe that this would break anything. For clients which do not provided thischarset attribute, the current muddled logic would still apply.And it would certainly be simpler to implement, than the logic described in the HTML-5document.

Pretty much the same solution applies to POSTs in the "multipart/form-data" format, whereeach posted parameter already has its own section with a MIME header. Whenever one ofthese parameters is text, it should specify a charset. (And if it doesn't, then thecurrent muddle applies).


The only remaining muddle is with the parameters passed inside the URL, as a 
query-string.

But for those, one could apply for example the same mechanism as is already applied fornon-ASCII email header values (see https://tools.ietf.org/html/rfc2047). This is notreally ideal in terms of simplicity, but 1) the code exists and works and 2) it wouldcertainly be preferable to the current muddled situation and recurrent parameter encodingproblems. (And again, for clients which do not use this, then the current muddle applies).

Altogether, to me it looks like there are 2 bodies of experts, one on the HTML-and-clientside and one on the HTTP-and-webserver side (or maybe these are 4 bodies), who have notreally been talking to eachother constructively on this issue for years.The result being that instead of agreeing on some simple rules, each one of them kind ofpatched together its own separate set of rules (and a lot of complex software), to obtainfinally something which still does not really solve the interoperability problemfundamentally.


The current situation is nothing short of ridiculous :

- there are many character sets/encodings in use, but most/all of them are clearly definedand named

- there are millions of webservers, and billions of web clients
But fundamentally :

- currently, a client has no way to know for sure what character set/encoding it shoulduse, when it first tries to send some piece of text data to a webserver- currently, a webserver has no way to know for sure in what character set/encoding aclient is sending text data to it

I'm sure that we can do better. But someone somewhere has to take the initiative. Andwho better than an open-source software foundation whose products already dominates theworldwide webserver market ?



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Sanity Check

Reply via email to