Re: [OT] distinction between resource charset and format octet decoding
On 2/6/2020 12:43 PM, Christopher Schultz wrote: … * Therefore `web.xml` settings, HTTP headers, etc. are all irrelevant, as this is an issue dealing with the file format itself, and the latest spec for the file format says to use UTF-8, so everyone should use UTF-8 already. Except for everyone who already uses something else and expects everything to be backward-compatible. I think there comes a time where we have to more forward after some critical level of usage is reached. I think we've passed that point. Modern browsers in the sense that you mention are not backwards-compatible for `application/x-www-form-urlencoded`. So what are we being compatible with by not using UTF-8 decoding? Do we have anything besides browsers consuming output from legacy JSP apps? As noted the browsers break when we try to be "backwards-compatible" in the sense you mention. The problem is that you don't get to declare what's "best" for everyone and then the whole world does what you want. But here I would imagine that already agrees what's best; the debate is whether we should do different than what we know is best because of some outdated specs. (And I say that as a huge proponent of following standards.) I'll give you an example that is directly relevant. Over 10 years ago I strongly advocated to the RDF group that the Internet should abandon the outdated practice of requiring that `text/*` media types default to US-ASCII; otherwise there would be no point in using `text/*` for anything going forward! (That's why we went through a sad phase where everyone was using `application/*` for text formats because they wanted to default to something other than US-ASCII.) * https://www.w3.org/2008/01/rdf-media-types * https://lists.w3.org/Archives/Public/www-archive/2007Dec/0059.html Sure enough, eventually someone saw the light (I won't claim I had anything to do with it, but it is exactly what I was arguing for) and created https://tools.ietf.org/html/rfc6657, which says that individual `text/*` types can choose a default other than ASCII. Finally we're not stuck in the past anymore! I would say that someone needs to create an updated `application/x-www-form-urlencoded` specification prescribing UTF-8 decoding of encoded octets, except that the WhatWG has already done that! So I'm not declaring that everyone should do it "my" way. I'm saying everyone should follow the latest spec which already exists. Anyway, thanks for listening. I think it's a fun discussion, and I wasn't being combative---I just wanted to tell a bit of the story. I need to get back to work now. :) Thanks again for the change in Tomcat 10! Garret
Re: [OT] distinction between resource charset and format octet decoding
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Garret, On 2/6/20 10:25 AM, Garret Wilson wrote: > On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote: >> … >>> As of Tomcat 10, conf/web.xml contains the following: >>> >>> >>> UTF-8 >>> UTF-8 >>> >>> >>> That *should* have the effect you are looking for but I confess I >>> haven't tested it in any great detail. >>> >> >> As I am sure many people (Christopher included) would agree, the >> real solution would be for browsers and other HTTP clients to >> indicate clearly in the request, the charset/encoding of each >> text parameter that they are sending. There are even HTTP headers >> already defined for that. > > > Which HTTP headers are you referring to? `Content-Type`? It is my > opinion that this is irrelevant and not applicable. > > As I explained (extensively) in my original post for this thread > back on 2019-01-08, the issue is not the charset of > `application/x-www-form-urlencoded`. That media type is made up of > ASCII characters. It doesn't matter whether you say it's ASCII, > ISO-8859-1, UTF-8, or whatever, the actual characters stay 100% the > same. Hmm. Not always. While it may be true that: 1. ASCII, ISO-8859-1, and UTF-8 are very common 2. ASCII, ISO-8859-1, and UTF-8 share the first 127 code points It is not true that: 3. All character encodings share the first 127 code points. UTF-16 doesn't follow that pattern. > At issue is when certain octets are encoded (as specified by the > `application/x-www-form-urlencoded` media type itself), what > charset to use when decoding them. This is independent of the > encoding of the media type itself; rather this is defined by the > specification for the format. Correct. And there is lack of agreement for URLs, so browsers decided to make it up. It's not possible to guess what the browser has chosen because it does not advertise it in any way (absent a standard). The only 100% reliable way to do it would be to add a parameter to every request which has a known-correct value that can be unambiguously decoded. You just keep re-decoding the whole URL until that parameter value matches the known-correct value. Sounds like a lot of fun to implement across a whole application, right? > Unfortunately https://tools.ietf.org/html/rfc1866 actually says we > should use ASCII when decoding the octets, but this is severely > antiquated and doesn't fit with modern practice. The WhatWG > essentially redefines the format to say that the octets must be > interpreted as UTF-8: > > https://url.spec.whatwg.org/#application/x-www-form-urlencoded > > So to summarize my view: > > * The decoding of the `application/x-www-form-urlencoded` media > type encoded octets is completely independent of the charset > indicated in the `Content-Type` header, and rather goes to the > specification of the format itself. It's strange, because Content-Type can contain a charset parameter, but MIME specifically says that "charset" parameters are only appropriate for "text/*" MIME types. So for application/x-www-form-urlencoded, you "shouldn't" add that parameter. But there's no particular reason NOT to include it (it doesn't actually violate any spec) and adding it COMPLETELY AND UNAMBIGUOUSLY indicates what the browser chose as the encoding. > * RFC 1866 is severely out of date and out of step, and the > WhatWG's specification of the `application/x-www-form-urlencoded` > media type should be used instead. (Modern browser practice would > seem to agree with me.) RFC 1886 has been very much superseded. Also, HTML specs shouldn't be defining HTTP semantics. So ignore whatever is in RFC 1866 on multiple grounds. > * Therefore `web.xml` settings, HTTP headers, etc. are all > irrelevant, as this is an issue dealing with the file format > itself, and the latest spec for the file format says to use UTF-8, > so everyone should use UTF-8 already. Except for everyone who already uses something else and expects everything to be backward-compatible. The problem is that you don't get to declare what's "best" for everyone and then the whole world does what you want. I happen to agree with you (Everyone should move to UTF-8 for everything. Everywhere. Forever.), but you have to recognize that there is history and entrenched systems, environments, and mindsets. > The new default `web.xml` in Tomcat 10 is a wonderful step in the > right direction. +1 - -chris -BEGIN PGP SIGNATURE- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl48NAoACgkQHPApP6U8 pFgJ6A/+JSArcUkqm3P6n0awICXTuqIx0TU1oIf9bzivpAI/Na9fr//ebnwzmvoy EXpbnn97B7Sy8uZ1wvT0+PQLbmwVmM/f7zBk4q+7Ba/ogkmrSHeLlsCIbLAlXOLD kr/xDE4ftxrwR2+ZwuQwxH0muFH+4rq2SBFWTQnGORCQDqRRK7eQoQYHWE0HIAxj cAJmwkQEQyi+YHdgaUo0L4BU7lvgPGk7JyjbzWBiigFYy/1Du1caE7PzYLa5G3wZ BrYDA6QoQA+nUmXHn/ayUVXvsZc2l/nU/uM5m68Tp1iEVxdgp4u8XtHuqgv0Nzda IeQq9HOP8wd7l27/dk2DvlZBmSWt2XDOI5ig+NoLPT1ixyQIqVJ2K8SyayGdUHW9
Re: [OT] distinction between resource charset and format octet decoding
On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote: … As of Tomcat 10, conf/web.xml contains the following: UTF-8 UTF-8 That *should* have the effect you are looking for but I confess I haven't tested it in any great detail. As I am sure many people (Christopher included) would agree, the real solution would be for browsers and other HTTP clients to indicate clearly in the request, the charset/encoding of each text parameter that they are sending. There are even HTTP headers already defined for that. Which HTTP headers are you referring to? `Content-Type`? It is my opinion that this is irrelevant and not applicable. As I explained (extensively) in my original post for this thread back on 2019-01-08, the issue is not the charset of `application/x-www-form-urlencoded`. That media type is made up of ASCII characters. It doesn't matter whether you say it's ASCII, ISO-8859-1, UTF-8, or whatever, the actual characters stay 100% the same. At issue is when certain octets are encoded (as specified by the `application/x-www-form-urlencoded` media type itself), what charset to use when decoding them. This is independent of the encoding of the media type itself; rather this is defined by the specification for the format. Unfortunately https://tools.ietf.org/html/rfc1866 actually says we should use ASCII when decoding the octets, but this is severely antiquated and doesn't fit with modern practice. The WhatWG essentially redefines the format to say that the octets must be interpreted as UTF-8: https://url.spec.whatwg.org/#application/x-www-form-urlencoded So to summarize my view: * The decoding of the `application/x-www-form-urlencoded` media type encoded octets is completely independent of the charset indicated in the `Content-Type` header, and rather goes to the specification of the format itself. * RFC 1866 is severely out of date and out of step, and the WhatWG's specification of the `application/x-www-form-urlencoded` media type should be used instead. (Modern browser practice would seem to agree with me.) * Therefore `web.xml` settings, HTTP headers, etc. are all irrelevant, as this is an issue dealing with the file format itself, and the latest spec for the file format says to use UTF-8, so everyone should use UTF-8 already. The new default `web.xml` in Tomcat 10 is a wonderful step in the right direction. See my original post for more in-depth explanation. Garret
Re: [OT] distinction between resource charset and format octet decoding
On 06.02.2020 14:44, Mark Thomas wrote: On 06/02/2020 13:39, Garret Wilson wrote: On 2/6/2020 10:36 AM, Mark Thomas wrote: … Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Is this still on the list for discussion for Tomcat 10? No, because it has already been implemented for Tomcat 10 and is in the milestone release currently being voted on. Waitasec. I'm not used to good news, so I want to make sure I understand what you're saying. Are you saying that the proposed Tomcat 10 implementation already interprets encoded octets in web form submissions using UTF-8 by default?!! :O As of Tomcat 10, conf/web.xml contains the following: UTF-8 UTF-8 That *should* have the effect you are looking for but I confess I haven't tested it in any great detail. As I am sure many people (Christopher included) would agree, the real solution would be for browsers and other HTTP clients to indicate clearly in the request, the charset/encoding of each text parameter that they are sending. There are even HTTP headers already defined for that. (Nowadays the default could be Unicode/UTF-8). The problem is that browsers and other agents don't do that, although they undoubtedly always know themselves, and although it would solve a series of issues that have literally been there forever at the server and application level (*). I have often wondered if/why the Apache Foundation does not pack enough influence over the HTTP/HTML specifications process and over browser producers, to achieve that. (And if not the Apache Foundation, then who ?) (*) My own guess is that this basic thing (or lack of it) has cost over the years many thousands of lines of unnecessary code and many thousands of unproductive developer hours. As a tiny example, just consider the above web.xml parameters, and how much time in total was dedicated to their definition and implementation.. Never mind all the previous related filters and valves and their discussions on this list. And that's only for Tomcat. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 2/6/2020 10:44 AM, Mark Thomas wrote: … As of Tomcat 10, conf/web.xml contains the following: UTF-8 UTF-8 That *should* have the effect you are looking for but I confess I haven't tested it in any great detail. Yes! Oh, that is so wonderful. Thank you! I brought this issue up on the list over a year ago, and I have since published my entire comprehensive software development course (still being expanded). https://www.globalmentor.com/courses/softdev/ The course is centered around Tomcat as the server, and the lesson on HTML forms contains a section warning to use ``. https://www.globalmentor.com/courses/softdev/html-forms Once Tomcat 10 is released I'll be able to update this note as well. Thanks again! Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 06/02/2020 13:39, Garret Wilson wrote: > On 2/6/2020 10:36 AM, Mark Thomas wrote: >> … Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. >>> Is this still on the list for discussion for Tomcat 10? >> No, because it has already been implemented for Tomcat 10 and is in the >> milestone release currently being voted on. > > Waitasec. I'm not used to good news, so I want to make sure I understand > what you're saying. Are you saying that the proposed Tomcat 10 > implementation already interprets encoded octets in web form submissions > using UTF-8 by default?!! :O As of Tomcat 10, conf/web.xml contains the following: UTF-8 UTF-8 That *should* have the effect you are looking for but I confess I haven't tested it in any great detail. Mark > > It will be a joy to update the FAQ when this is released. > > Garret > > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 2/6/2020 10:36 AM, Mark Thomas wrote: … Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Is this still on the list for discussion for Tomcat 10? No, because it has already been implemented for Tomcat 10 and is in the milestone release currently being voted on. Waitasec. I'm not used to good news, so I want to make sure I understand what you're saying. Are you saying that the proposed Tomcat 10 implementation already interprets encoded octets in web form submissions using UTF-8 by default?!! :O It will be a joy to update the FAQ when this is released. Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 06/02/2020 13:30, Garret Wilson wrote: > On 1/8/2019 9:57 PM, Mark Thomas wrote: >> … >> >> Yes, this default is now very out-dated. That is a side-effect of: >> … >> As of Servlet 4.0 there is a specification compliant configuration >> option to change this default to any encoding of your choice. >> Obviously, UTF-8 is one of the options. You can do this by adding the >> following to your web.xml: >> … >> >> Whether Tomcat should ship with this setting present in conf/web.xml >> by default is something that should probably be discussed for Tomcat >> 10. Given the current state of the web, there is a reasonable case for >> doing so. I'll add that to the TOMCAT-NEXT discussion list. > > Is this still on the list for discussion for Tomcat 10? No, because it has already been implemented for Tomcat 10 and is in the milestone release currently being voted on. Mark > > In my opinion it would be a real shame if Tomcat 10 ships with a web > form encoding default that goes against the WhatWG specifications and > corrupts non ISO-8859-1 content under modern browsers. > > Garret > > P.S. Mark, please ignore the other email from my personal email address. > Because the Tomcat users list doesn't include my name in the "To:" > header, my email client didn't know to use the correct reply address. > > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 1/8/2019 9:57 PM, Mark Thomas wrote: … Yes, this default is now very out-dated. That is a side-effect of: … As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: … Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Is this still on the list for discussion for Tomcat 10? In my opinion it would be a real shame if Tomcat 10 ships with a web form encoding default that goes against the WhatWG specifications and corrupts non ISO-8859-1 content under modern browsers. Garret P.S. Mark, please ignore the other email from my personal email address. Because the Tomcat users list doesn't include my name in the "To:" header, my email client didn't know to use the correct reply address. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Sorry to bring up the non-UTF-8 escaped octets form POST problem again, but … On 1/8/2019 3:57 PM, Mark Thomas wrote: … As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: UTF-8 If you add it to conf/web.xml it applies to every web application deployed to Tomcat. Tomcat 9 uses this in the examples, manager and host-manager applications in place of the SetCharacterEncodingFilter. As you know I've already updated the Tomcat FAQ with the options for forcing Tomcat to interpret form POSTs with any escaped characters using UTF-8 octet sequences (as modern browsers send, and as HTML5 requires) instead of ISO-8859-1 (as the Servlet 4 spec says). But the problem is worse with the Spring community. If someone is using Spring Boot to create an executable JAR/WAR using embedded tomcat, Spring Boot does something to configure Tomcat to send the POSTs correctly (that is, as the modern web likes it, not like the Servlet 4 spec says). Unfortunately, if I use Spring Boot to make a WAR which is both a self-contained executing WAR /and/ a WAR deployable on Tomcat, when I deploy the WAR on Tomcat the encoded characters are using escaped ISO-8859-1 octets, so my web app breaks. Yes, the WAR runs differently if using Spring Boot embedded Tomcat or deployed on standalone Tomcat as a WAR. Spring Boot ignores any `web.xml` file. I guess I could create a `web.xml` file only for standalone Tomcat, but then this freezes Eclipse (as I posted elsewhere) because Eclipse doesn't understand ``. So like so many things on the web, this is a mess. This is a serious issue, in my opinion. The Servlet 4 specification is out of step with everything else in the ecosystem! Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Yes, can I just re-second (third?) that motion, and underscore the need for this to be changed in Tomcat 10? Thanks, Garret
Re: distinction between resource charset and format octet decoding
On 01/02/2019 17:58, Garret Wilson wrote: > OK, Mark, I've made my initial edits to the > https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check > them over!_ This is my first edit to the wiki. > > That page has a lot of legacy information, some of which had to do with > internal Tomcat stuff, and some of which had to do with minute details > of obsolete RFCs and evolution of browser behavior. I didn't want to > spend the entire day (week?) on this, so I tried to surgically to only > address the sections relating to POST of > application/x-www-form-urlencoded and how percent-encoded octets are > interpreted. I couldn't resist updating the specification links and > changing just a little prose about URL percent encoding. > > There is the risk now that other sections of the page are still outdated > and conflict with my changes, but most importantly the FAQ should > provide more complete information on how Tomcat web applications can be > made to work with modern browsers. > > Please let me know if I bungled anything or if I need to clarify something. LGTM. > Thanks for letting me participate. No need to thank us. We should be thanking you. Thank you. So, what do you want to work on next? ;) Cheers, Mark > > Garret > > On 1/23/2019 12:26 AM, Mark Thomas wrote: >> On 23/01/2019 05:07, Garret Wilson wrote: >>> On 1/15/2019 3:20 AM, Mark Thomas wrote: … Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. >>> >>> Ah, OK, that explains it. Very good to know. Maybe a little semantic >>> overloading, but as this is my first wiki account anywhere, I'm guessing >>> it's typical with whatever software you're using. >>> >>> Anyway my account is created, with username `GarretWilson`. After I get >>> permissions I'll update the info on octet encoding for >>> application/x-www-form-urlencoded in relation to the servlet spec. It >>> may not be immediately, but I'll slowly but surely get to it. >> Karma granted. Happy editing. >> >> Cheers, >> >> Mark >> >> - >> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org >> For additional commands, e-mail: users-h...@tomcat.apache.org >> > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
OK, Mark, I've made my initial edits to the https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check them over!_ This is my first edit to the wiki. That page has a lot of legacy information, some of which had to do with internal Tomcat stuff, and some of which had to do with minute details of obsolete RFCs and evolution of browser behavior. I didn't want to spend the entire day (week?) on this, so I tried to surgically to only address the sections relating to POST of application/x-www-form-urlencoded and how percent-encoded octets are interpreted. I couldn't resist updating the specification links and changing just a little prose about URL percent encoding. There is the risk now that other sections of the page are still outdated and conflict with my changes, but most importantly the FAQ should provide more complete information on how Tomcat web applications can be made to work with modern browsers. Please let me know if I bungled anything or if I need to clarify something. Thanks for letting me participate. Garret On 1/23/2019 12:26 AM, Mark Thomas wrote: On 23/01/2019 05:07, Garret Wilson wrote: On 1/15/2019 3:20 AM, Mark Thomas wrote: … Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. Ah, OK, that explains it. Very good to know. Maybe a little semantic overloading, but as this is my first wiki account anywhere, I'm guessing it's typical with whatever software you're using. Anyway my account is created, with username `GarretWilson`. After I get permissions I'll update the info on octet encoding for application/x-www-form-urlencoded in relation to the servlet spec. It may not be immediately, but I'll slowly but surely get to it. Karma granted. Happy editing. Cheers, Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 2/1/2019 9:38 AM, Christopher Schultz wrote: Amazing. A close reading of RFC 3986 reveals that there is no clear mandate for UTF-8 in existing URI schemes, even though recommended for new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat included), so I'll try to indicate that. Wait... are you saying that _it's the Wild West out there?_ ;) Yes. The web is indeed held together with duct-tape and bailing wire. It's amazing that it works as well as it does. Hahaha. I'm /so/ happy someone agrees with me! Here's to improving things with a little JB Weld once in a while. (That's what my grandparents used on the farm when the bailing wire and duct tape couldn't handle it.) Garret
Re: distinction between resource charset and format octet decoding
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Garret, On 2/1/19 11:08, Garret Wilson wrote: > On 2/1/2019 7:23 AM, Garret Wilson wrote: >> … * "There /is no default encoding for URIs/ specified anywhere, >> which is why there is a lot of confusion when it comes to >> decoding these values." Sheesh, this is is ancient. I'll correct >> it as per https://tools.ietf.org/html/rfc3986#section-2.5 . > > > Amazing. A close reading of RFC 3986 reveals that there is no > clear mandate for UTF-8 in existing URI schemes, even though > recommended for new schemes. Anyway, everyone seems to have settled > on UTF-8 (Tomcat included), so I'll try to indicate that. Wait... are you saying that _it's the Wild West out there?_ ;) Yes. The web is indeed held together with duct-tape and bailing wire. It's amazing that it works as well as it does. - -chris -BEGIN PGP SIGNATURE- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxUhDEACgkQHPApP6U8 pFhWlA/8Cxr6xzT8+cw5Mu/a8cH788p+ucK4QtO9Qlm6EBhhX2sW9BelWpk2ftOX xypZkwW155D2hlz58eUTGSoFl92rgFZNXmXBoIXd+MDgNS/b0zgabb7N7wlHswzj LJArA9GtXNjRy5vJc4Bpe37ZpiqcV9f/sbQhSO31ZrJYvnVuOOYszzfp2g6UWlg5 +OAgfi2L99uMxJdqc81eIVsL6mmmhlkJYe6ejAZjb/EQ2Lk74MKlgCUfaoasCdYd hqdQJIBpRGvUnx6UEoq+sdEilBAXTJocGv8cyOFQY5rHcaTy7WIQ9mIWilTjBb6O gxWJbgRfX+uOVhTT5mo7LoE+YVLQZ3QPAM21SEXtX3PR5Vuk4hB8SYj3/er7S7v2 /kPL0d5K2DsO8034PoZQBturIV8pkiF5jqr2nSTND/B0nFK9hcZu27qY9RigHF95 8owMY7/hdMsK2PlYOwyj6dZSMx94Iy5mWDCrF3GUFCbEN9u3/6HoRYuJZOpCv8h1 aZHZmiYDEtxzxL8OkXNqyuBu4k+HJ58/ABMelpXOjxMVHuFXkqny6XiqrzyWac+z yW1otX/uLKgqKI9PL3O8MfzVS5LZ6XVtprkZUDhCBvsA8vQTZYBRVQu3DiGMPojj U4STB1VBJSV4I67bBhkQaAZnsqIgeNi/qzHC+5h6hbHl+Me1lRg= =Z4XG -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 2/1/2019 7:23 AM, Garret Wilson wrote: … * "There /is no default encoding for URIs/ specified anywhere, which is why there is a lot of confusion when it comes to decoding these values." Sheesh, this is is ancient. I'll correct it as per https://tools.ietf.org/html/rfc3986#section-2.5 . Amazing. A close reading of RFC 3986 reveals that there is no clear mandate for UTF-8 in existing URI schemes, even though recommended for new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat included), so I'll try to indicate that. Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Good morning, I'm just getting to the editing. I'm going to list some thoughts I have as I go through this, so you can verify things: * The servlet spec links are way out of date. I'll update them. * "There /is no default encoding for URIs/ specified anywhere, which is why there is a lot of confusion when it comes to decoding these values." Sheesh, this is is ancient. I'll correct it as per https://tools.ietf.org/html/rfc3986#section-2.5 . * "Most of the web uses ISO-8859-1 as the default for query strings." Is this still true?! In light of the above, I would think it is not true, but I wanted to ask, as you know better about what you've seen "in the wild". Garret
Re: distinction between resource charset and format octet decoding
On 23/01/2019 05:07, Garret Wilson wrote: > On 1/15/2019 3:20 AM, Mark Thomas wrote: >> … >> Anything in PascalCase becomes a link to a wiki page of that name. >> Usernames are created in this form so references to the user >> automatically become links to that user's page in the wiki. > > > Ah, OK, that explains it. Very good to know. Maybe a little semantic > overloading, but as this is my first wiki account anywhere, I'm guessing > it's typical with whatever software you're using. > > Anyway my account is created, with username `GarretWilson`. After I get > permissions I'll update the info on octet encoding for > application/x-www-form-urlencoded in relation to the servlet spec. It > may not be immediately, but I'll slowly but surely get to it. Karma granted. Happy editing. Cheers, Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 1/15/2019 3:20 AM, Mark Thomas wrote: … Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. Ah, OK, that explains it. Very good to know. Maybe a little semantic overloading, but as this is my first wiki account anywhere, I'm guessing it's typical with whatever software you're using. Anyway my account is created, with username `GarretWilson`. After I get permissions I'll update the info on octet encoding for application/x-www-form-urlencoded in relation to the servlet spec. It may not be immediately, but I'll slowly but surely get to it. Cheers, Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 15/01/2019 03:39, Garret Wilson wrote: > On 1/9/2019 2:30 AM, Mark Thomas wrote: >> … >> Create yourself an account at https://wiki.apache.org/tomcat (click >> login then create an account) and let the list know your ID. Then one of >> the admins can add you to the allowed editors. > > > I was just ready to create an account, but I want to verify the details > so I don't screw things up. > > * It asks for a "Name". Is this a username, I suppose? So we don't > maintain our "name" separate from our "login username"? Yes, it is your username. Any linkage from that to your "public name" would be maintained on your user page - if you wish. > * It says to use "FirstnameLastName". Are you literally wanting us to > use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as > one who works with protocols all the time, I automatically assume > this stuff is important. But I prefer to use lowercase on my > usernames; I'm a little confused about why this would want > PascalCase for a login username. (I can't think of another system > that I use that requires PascalCase usernames.) Think of it as a SHOULD rather than a MUST. > My guess is that it's trying to maintain a "human name" and a "username" > but combine them both into one field or something. I can't say this > approach is typical… Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. It isn't a feature we use much at the moment. A quick check shows that most, but not all, contributors have created their user name in PascalCase. For example, take a look at https://wiki.apache.org/tomcat/AndrewCarr Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 1/9/2019 2:30 AM, Mark Thomas wrote: … Create yourself an account at https://wiki.apache.org/tomcat (click login then create an account) and let the list know your ID. Then one of the admins can add you to the allowed editors. I was just ready to create an account, but I want to verify the details so I don't screw things up. * It asks for a "Name". Is this a username, I suppose? So we don't maintain our "name" separate from our "login username"? * It says to use "FirstnameLastName". Are you literally wanting us to use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as one who works with protocols all the time, I automatically assume this stuff is important. But I prefer to use lowercase on my usernames; I'm a little confused about why this would want PascalCase for a login username. (I can't think of another system that I use that requires PascalCase usernames.) My guess is that it's trying to maintain a "human name" and a "username" but combine them both into one field or something. I can't say this approach is typical… Garret
Re: distinction between resource charset and format octet decoding
On 09/01/2019 00:50, Garret Wilson wrote: > Hi, Mark, and thanks for some quick response. You provided some info I > wasn't aware of. Some responses below: > > On 1/8/2019 9:57 PM, Mark Thomas wrote: >> On 08/01/2019 21:31, Garret Wilson wrote: >> >> >> >>> But as discussed above, this is completely wrong: the resource >>> character encoding of a request sent in >>> `application/x-www-form-urlencoded` should have absolutely no bearing >>> on how the encoded octets within that resource are decoded. >> >> That is not the correct interpretation of section 3.12 of the Servlet >> 4.0 specification (note the section numbers do vary between spec >> versions). Tomcat implements the correct interpretation - i.e. the >> charset from the request content-type defines how encoded octets are >> decoded and, if none is specified, ISO-8859-1 is used as the default. > > > Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat > is correctly following the spec, but I would still say the servlet spec > is wrong to make any linkage at all between resource encoding and %nn > interpretation. In fact reading the prose it's not clear to me that the > servlet spec is even strongly tying the %nn interpretation to the > encoding. It just sees to say that, unless otherwise specified, the %nn > interpretation should be ISO-8859-1. And actually that's a step up from > the HTML 4.0.1 spec, which in > https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates > that they should be interpreted as US-ASCII codes. :( > > You indicate that this is all out of date, and I think we're in > agreement there. We really, really need to get the next servlet > specification to remove this part. In fact the servlet specification > should defer to the official `application/x-www-form-urlencoded` > specification, which at this point I think is the W3C HTML5 spec, which > in turn defers to the WHATWG spec (which clearly says that UTF-8) should > be used. What makes all of this more of a mess is that there seems to be > no way to work around this from the client side, e.g. by putting > something in the HTML to indicate UTF-8, as > `application/x-www-form-urlencoded` doesn't support a `charset` parameter. > > Anyway if there are any openings on the committee to update the servlet > spec, let me know. That has moved to Eclipse. The process to update the spec is still being defined. The Jakarta EE Servlet API project is the project to get involved in. >> ... >> As of Servlet 4.0 there is a specification compliant configuration >> option to change this default to any encoding of your choice. >> Obviously, UTF-8 is one of the options. You can do this by adding the >> following to your web.xml: >> >> UTF-8 > > Oh, that is really good to know, thanks!! Still I say that the request > character encoding is orthogonal to the %nn encoding, but, still, it's > good to have an implementation-agnostic way to do it. > >> >> >> Whether Tomcat should ship with this setting present in conf/web.xml >> by default is something that should probably be discussed for Tomcat >> 10. Given the current state of the web, there is a reasonable case for >> doing so. I'll add that to the TOMCAT-NEXT discussion list. > > > Yes please! If I can help in any way, let me know. > > >> >> The Tomcat Wiki also needs to be updated to take account of this new >> configuration option (and the related ). >> Since it is a wiki and this is clearly an issue you care about would >> you like to tackle that? > > > Yes, I'd love to. Let me know what permissions I need, etc. Create yourself an account at https://wiki.apache.org/tomcat (click login then create an account) and let the list know your ID. Then one of the admins can add you to the allowed editors. Apologies for the hoop jumping required but without the manual approval step for new accounts, the ASF project wiki's were being deluged in spam. Mark > > I have an international flight boarding right now so I have to go, and I > may not reply for the next few hours, but definitely sign me up. > > Thanks, > > Garret > > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Hi, Mark, and thanks for some quick response. You provided some info I wasn't aware of. Some responses below: On 1/8/2019 9:57 PM, Mark Thomas wrote: On 08/01/2019 21:31, Garret Wilson wrote: But as discussed above, this is completely wrong: the resource character encoding of a request sent in `application/x-www-form-urlencoded` should have absolutely no bearing on how the encoded octets within that resource are decoded. That is not the correct interpretation of section 3.12 of the Servlet 4.0 specification (note the section numbers do vary between spec versions). Tomcat implements the correct interpretation - i.e. the charset from the request content-type defines how encoded octets are decoded and, if none is specified, ISO-8859-1 is used as the default. Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat is correctly following the spec, but I would still say the servlet spec is wrong to make any linkage at all between resource encoding and %nn interpretation. In fact reading the prose it's not clear to me that the servlet spec is even strongly tying the %nn interpretation to the encoding. It just sees to say that, unless otherwise specified, the %nn interpretation should be ISO-8859-1. And actually that's a step up from the HTML 4.0.1 spec, which in https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates that they should be interpreted as US-ASCII codes. :( You indicate that this is all out of date, and I think we're in agreement there. We really, really need to get the next servlet specification to remove this part. In fact the servlet specification should defer to the official `application/x-www-form-urlencoded` specification, which at this point I think is the W3C HTML5 spec, which in turn defers to the WHATWG spec (which clearly says that UTF-8) should be used. What makes all of this more of a mess is that there seems to be no way to work around this from the client side, e.g. by putting something in the HTML to indicate UTF-8, as `application/x-www-form-urlencoded` doesn't support a `charset` parameter. Anyway if there are any openings on the committee to update the servlet spec, let me know. ... As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: UTF-8 Oh, that is really good to know, thanks!! Still I say that the request character encoding is orthogonal to the %nn encoding, but, still, it's good to have an implementation-agnostic way to do it. Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Yes please! If I can help in any way, let me know. The Tomcat Wiki also needs to be updated to take account of this new configuration option (and the related ). Since it is a wiki and this is clearly an issue you care about would you like to tackle that? Yes, I'd love to. Let me know what permissions I need, etc. I have an international flight boarding right now so I have to go, and I may not reply for the next few hours, but definitely sign me up. Thanks, Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 08/01/2019 21:31, Garret Wilson wrote: But as discussed above, this is completely wrong: the resource character encoding of a request sent in `application/x-www-form-urlencoded` should have absolutely no bearing on how the encoded octets within that resource are decoded. That is not the correct interpretation of section 3.12 of the Servlet 4.0 specification (note the section numbers do vary between spec versions). Tomcat implements the correct interpretation - i.e. the charset from the request content-type defines how encoded octets are decoded and, if none is specified, ISO-8859-1 is used as the default. Yes, this default is now very out-dated. That is a side-effect of: - how long the Servlet specification has been around - the very conservative approach taken by Java EE in terms of backwards compatibility (once set, defaults are very rarely - if ever - changed) - arguably missed opportunities to address this issue prior to Servlet 4.0 As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: UTF-8 If you add it to conf/web.xml it applies to every web application deployed to Tomcat. Tomcat 9 uses this in the examples, manager and host-manager applications in place of the SetCharacterEncodingFilter. Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. The Tomcat Wiki also needs to be updated to take account of this new configuration option (and the related ). Since it is a wiki and this is clearly an issue you care about would you like to tackle that? Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
distinction between resource charset and format octet decoding
I have question (using Tomcat 9.0.12 on Windows 10), and I'd like someone on the Tomcat development team to clarify a distinction for me regarding resource charsets and octet decoding in a particular format. I am not a newbie, and although the answer to my question may seem obvious, it goes to a critical issue that I believe to be a fundamental bug in Tomcat encoding processing. Let's say that as an HTTP client I retrieve a resource `readme.txt` from Tomcat, and Tomcat clearly indicates via the HTTP response headers that the `Content-Type` is `text/plain; charset=ISO-8859-1`. That file contains, among things, a line that says: See https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9 for more info. I want parse the text file and present a live link to the user (email clients do this all the time), but I want to make the link "pretty" by decoding the URL. The question is: do I decode the octets using UTF-8, to show `…fullName=Flávio+José`, or do I use ISO-8859-1 to decode the octets, so that I show `…fullName=Flávio+José`? (Flávio José is a famous Brazilian forró singer and musician, by the way.) The content type encoding of `readme.txt` is ISO-8859-1, so I must use ISO-8859-1 to decode the octets in `Fl%C3%A1vio+Jos%C3%A9`, yielding `…fullName=Flávio+José`, right??! No, of course not. The decoding of the octet sequence is independent of the resource encoding, and represents a separate layer of encoding _on top_ of the resource encoding. It wouldn't matter whether the text file were encoded in UTF-8, ISO-8859-1, or US-ASCII—the URL would still be https://example.com/example.jsp?fullName=Fl%C3%A1vio+Jos%C3%A9, and its octets should still be decoded using UTF-8 as per RFC 3986. I'll get right to the point; the above was a rhetorical question used as an analogy. The Tomcat FAQ at https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q9 indicates that the default encoding for an HTTP POST is ISO-8859-1. That is true. However Tomcat then goes further to then assume that it should decode _the octets of `application/x-www-form-urlencoded`_ using ISO-8859-1 as well! This is simply wrong; the octets should be interpreted as a sequence of UTF-8 octets; see https://url.spec.whatwg.org/#concept-urlencoded-serializer . This means if my browser sends a POST with content `fullName=Fl%C3%A1vio+Jos%C3%A9` using `application/x-www-form-urlencoded`, Tomcat will interpret this request parameter as `Flávio José` in my servlet/JSP, when it should interpret it as `Flávio José`. (Tomcat correctly decodes the octet when used as a query parameter rather than a POST parameter.) Now it may be that the FAQ is simply out of date; it still seems to think that encoded URI octets should not be interpreted as UTF-8, completely ignoring RFC 3986. If so, it is long out of date; RFC 3986 came out in 2005. (And indeed, Tomcat works with UTF-8 octets in URIs.) But out of date or not, the FAQ at https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 then recommends that to force Tomcat to interpret the `application/x-www-form-urlencoded` octets as UTF-8, I must set the `org.apache.catalina.filters.SetCharacterEncodingFilter` filter (in some `web.xml` file) to `UTF-8`. (I can alternatively put `<% request.setCharacterEncoding("UTF-8"); %>` in my JSP.) And sure enough, it fixes the problem. But as discussed above, this is completely wrong: the resource character encoding of a request sent in `application/x-www-form-urlencoded` should have absolutely no bearing on how the encoded octets within that resource are decoded. They must be decoded as UTF-8, irrespective of what "character encoding" Tomcat assumes the content to be. Tomcat has updated the way it decodes URIs to support UTF-8; it is time Tomcat does the same for `application/x-www-form-urlencoded` values. The current approach is broken in the context of the modern web, and the workaround is simply wrong. I also raised this at https://stackoverflow.com/q/54094982/421049 . I would have filed a Tomcat Bugzilla issue, but the bug report form indicated I should report the problem on this list first. Thank you in advance for your attention to this matter. Garret Wilson GlobalMentor, Inc. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org