Re: POST request encoding - Tomcat/JVM configuration?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jan, On 10/24/2009 5:58 AM, Pfeifer Jan wrote: String decoded = new String(param.getBytes(iso-8859-1),UTF-8); (I'm all out of breath from replacing those quot; escapes with symbols... I need to get more exercise). The above line of code is only valid if: 1. The bytes coming from the client were supposed to be UTF-8 2. Your server has been configured to interpret the data coming from clients unconditionally as ISO-8859-1 3. The characters you are trying to decode are in the ASCII character set Why the third constraint? Because, if the clients sends UTF-8 and the server decodes that as ISO-8859-1, information is lost in the translation... the bytes are not going to be magically re-combined into UTF-8 bytes when you call getBytes(ISO-8859-1) on them. It's only going to make things worse. The only time transcoding bytes is appropriate is when you are decoding GET parameters, because any POST parameters ought to have been sent with a correct Content-Type (including a charset) parameter. It would be better to install a filter to set the character encoding of the request /before/ any data has been read from it if you were worried about the client sending an incorrect content type. As for GET parameters, you're pretty much screwed as Andre points out: there's just no standard for URL encoding (okay, yes, there is a standard: use URL/%-encoded ISO-8859-1, unless the browser is modern and uses UTF-8 instead of ISO-8859-1 as its default URL encoding). It's just a mess. for a start, I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more correct way to decode String? If you always want your strings decoded as UTF-8, then set URIEncoding=UTF-8 on your Connector and be done with it. Don't have your webapp's code re-coding strings that come from clients. Again, read the CharacterEncoding page on the Wiki, as previously suggested. All will become clear. Well, the solution becomes clear, at least. - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrlyfgACgkQ9CaO5/Lv0PAenQCgsmZN7pMGMuhuBO9x1hZ3z5A2 MV0AoJW1MtGpPwWDGrdwy50NhETwvedX =2ZXB -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: POST request encoding - Tomcat/JVM configuration?
I apologize for my last post, I accidentaly doublecoded it. Original post: quot;That JSP should work on any clean Tomcat installation. quot;It doesn't workquot; isn't very informative. We need details.quot; There realy is not much more to say. quot;it worked, now it does notquot;. I also use myeclipse IDE. I disabled it,reinstalled tomcat, tried it and quot;it does not workquot;. I believe that clean instalation of the whole system will work, but this is not an option at the moment. Neither Tomcat upgrade. What else can change the way how Tomcat/Java treats the request body? quot;String decoded = new String(param.getBytes(quot;iso-8859-1quot;),quot;UTF-8quot;); for a startquot; I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more quot;correctquot; way to decode String? Jan
Re: POST request encoding - Tomcat/JVM configuration?
Pfeifer Jan wrote: ... I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more quot;correctquot; way to decode String? Jan, this whole area of the character set in which HTTP requests come into a server, and are decoded by the server, is complicated, confusing, and generally not well-defined (or defined in contradictory ways) by the Internet RFCs themselves. In short, there can be many reasons why you are not getting the data in the character set that you expect, and finding the specific reason that applies in your case can be tedious and involve several levels. To resolve it, you have to be very systematic, and check every step one by one. Here are some principles : 1) the general default for the HTTP protocol, and for HTML, is iso-8859-1. Anything else, you have to explicitly specify. iso-8859-1 is at the same time a character set, and an encoding, in which each character is represented by one byte. 2) internally, Java represents all character strings as Unicode (which is a character set), using a 16-bit representation for each character (which is an encoding). (1) and (2) above mean that somewhere, no matter what, some character set translation is going to take place, between the web and your Java webapp, and vice-versa between your webapp and the web. The trick is to get the pieces in place so that the /correct/ translations take place in each direction. 3) iso-8859-1 (in fact all iso-8859-x character sets and encodings) can only represent each 256 different characters, which is not enough to cover all languages used on the WWW nowadays. So if your applications have to use Czech and German at the same time, you should not use a iso-8859 charset. 4) UTF-8 is a popular encoding of Unicode, where each character is represented by one or more bytes. The big advantage of Unicode/UTF-8 is that it can represent all characters of all languages used on the WWW. The inconvenient of Unicode/UTF-8 at the moment is that, for historical reasons, it is /not/ the HTTP/HTML default charset, so you have to explicitly specify it in several places. 5) despite what is said above about the default for HTTP being iso-8859-1, URLs are an exception. A URL, by definition, is not in any specific character set or encoding. The definition of URLs just says that, whatever the character set and encoding used, *any byte whose value does not match one of the printable characters of the US-ASCII range (roughly [0-9A-Za-z] + some), must be encoded in %AB notation, where %AB is : the % sign, followed by a 2-digit hexadecimal representation of the byte value. In other words it means that, when interpreting data that comes as part of a URL (like the query string in a HTTP GET), - the server first decodes the URI from the %AB encoding above, back into a series of bytes - then the server further decodes this series of bytes into a string of characters, using some charset encoding - but, the only way to know in which character set the data really is, is *by convention* between the client and the server. The convention, historically so far, has always been iso-8859-1. Recently and slowly, it seems that this convention is now shifting toward UTF-8. But note that it is a convention still, and that in order to make sure that your application (and Tomcat before it) can consider the parameters from a GET URL to be UTF-8, /you/ have to make sure that all URLs on which a user may click in one of /your/ pages, is indeed encoding the URLs that way. (And thus basically also, if you receive a request from an unknown source, well, you have to guess..) See in Tomcat 6.0 docs, the following attribute of the HTTP Connector : URIEncoding : This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used. (The above applies to GET requests, because in that case the request parameters are passed as part of the URI) Now about POST requests : In a POST, the request parameters are not sent as part of a query string in a URI, but they are sent in the *body* of the request. There are 2 ways to format a POST request from the client side : a) as a url-encoded body (the default). b) as a multipart/form-data body. (That is the case if the Form tag contains the attribute : enctype=multipart/form-data ) In (a), the body consists of one long string, which looks like the query string of a GET : param1=value1param2=value2.paramn=valuen The charset and encoding of that string are supposed to be given by the Content-type HTTP header of that POST request. In (b), it is more complicated : The body of the request is composed of parts, each part representing one parameter. Each part /should/ have its own Content-type header, indicating the type of that part, and if applicable, the character set and encoding of that part.
Re: POST request encoding - Tomcat/JVM configuration?
Thank you for reply. I've checked that link several times already. I have not problem with code itself. It works. Resp. it works everywhere except my computer now. I am looking for a reason why after almost a year it stopped work properly. Which configuration file I missed. I have tested many samples with no luck so far. Jan
Re: POST request encoding - Tomcat/JVM configuration?
Pfeifer Jan wrote: Thank you for reply. I've checked that link several times already. I have not problem with code itself. It works. I disagree. I can see a whole bunch of things wrong with that code. Does the sample JSP in the FAQ work? If not, you have some system config issues to fix. If it does then you just need to fix you code. Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: Re: POST request encoding - Tomcat/JVM configuration?
quot;That JSP should work on any clean Tomcat installation. quot;It doesn't workquot; isn't very informative. We need details.quot; There realy is not much more to say. quot;it worked, now it does notquot;. I also use myeclipse IDE. I disabled it,reinstalled tomcat, tried it and quot;it does not workquot;. I believe that clean instalation of the whole system will work, but this is not an option at the moment. Neither Tomcat upgrade. What else can change the way how Tomcat/Java treats the request body? quot;String decoded = new String(param.getBytes(quot;iso-8859-1quot;),quot;UTF-8quot;); for a startquot; I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more quot;correctquot; way to decode String? Jan
Re: POST request encoding - Tomcat/JVM configuration?
Pfeifer Jan wrote: quot;That JSP should work on any clean Tomcat installation. quot;It doesn't workquot; isn't very informative. We need details.quot; There realy is not much more to say. quot;it worked, now it does notquot;. I also use myeclipse IDE. I disabled it,reinstalled tomcat, tried it and quot;it does not workquot;. I believe that clean instalation of the whole system will work, but this is not an option at the moment. Neither Tomcat upgrade. What else can change the way how Tomcat/Java treats the request body? quot;String decoded = new String(param.getBytes(quot;iso-8859-1quot;),quot;UTF-8quot;); for a startquot; I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more quot;correctquot; way to decode String? Jan Jan, I don't know if this affects only my mail reader, but your messages to the list, for me, are almost impossible to read because of the apparent profusion of html escapes in them. Can you maybe make sure that you are posting only in plain text ? Until I am sure that it is not only a problem on my side, I will refrain from further comments about posting encoding-related issues in html... - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: POST request encoding - Tomcat/JVM configuration?
On 23/10/2009 15:21, André Warnier wrote: Pfeifer Jan wrote: quot;That JSP should work on any clean Tomcat installation. quot;It doesn't workquot; isn't very informative. We need details.quot; There realy is not much more to say. quot;it worked, now it does notquot;. I also use myeclipse IDE. I disabled it,reinstalled tomcat, tried it and quot;it does not workquot;. I believe that clean instalation of the whole system will work, but this is not an option at the moment. Neither Tomcat upgrade. What else can change the way how Tomcat/Java treats the request body? quot;String decoded = new String(param.getBytes(quot;iso-8859-1quot;),quot;UTF-8quot;); for a startquot; I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more quot;correctquot; way to decode String? Jan Jan, I don't know if this affects only my mail reader, but your messages to the list, for me, are almost impossible to read because of the apparent profusion of html escapes in them. Can you maybe make sure that you are posting only in plain text ? Until I am sure that it is not only a problem on my side, I will refrain from further comments about posting encoding-related issues in html... snap. encoded ampersands no spacing: unreadable. p - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: POST request encoding - Tomcat/JVM configuration?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 André, On 10/23/2009 10:21 AM, André Warnier wrote: Pfeifer Jan wrote: I don't know if this affects only my mail reader, but your messages to the list, for me, are almost impossible to read because of the apparent profusion of html escapes in them. +1 Also, quotes are mixed-up with the main body of the message. I would guess a broken mailer. It seems that Smart4Web 2.0 Mailer isn't so smart. :( - -chris -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkriDsoACgkQ9CaO5/Lv0PAHOgCdGucbUtWZ35Fv8Tiar+F6jJaa kXUAoKGy3DaQI4em01N0HwFDxI8EMl/p =bZIK -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: POST request encoding - Tomcat/JVM configuration?
Pfeifer Jan wrote: Hi, I am running webapp for some time with no problems. After some quot;changequot; that I am not able to identify my POST (GET works fine) requests are messed up. Just on my local server. http://wiki.apache.org/tomcat/FAQ/CharacterEncoding Mark Some facts: Tomcat 5.0.28, jdk 1.4.2.07 winXP, (others in,xp win server,linux) Following simple code produces different result on my server and the others (web.xml,server.xml,jdk configuration is the same) Form page: lt;!DOCTYPE HTML PUBLIC quot;-//W3C//DTD HTML 4.01//ENquot; quot;http://www.w3.org/TR/html4/strict.dtdquot;gt; lt;htmlgt; lt;headgt; lt;meta http-equiv=quot;Content-Typequot; content=quot;text/html; charset=UTF-8quot;gt; lt;titlegt;Encoding testlt;/titlegt; lt;script type=quot;text/javascriptquot;gt; function test(method){ var frm = document.getElementById(quot;formquot;); frm.method = method; frm.submit(); } lt;/scriptgt; lt;/headgt; lt;bodygt; lt;form method='POST' action='post_process.jsp' id='form'gt; lt;input type='text' name='param'gt; lt;/formgt; lt;button onclick='test(quot;getquot;)'gt;getlt;/buttongt; lt;button onclick='test(quot;postquot;);'gt;postlt;/buttongt; lt;/bodygt; lt;/htmlgt; post_procces.jsp lt;%@ page language=quot;javaquot; session=quot;truequot; contentType=quot;text/html; charset=UTF-8quot;%gt; lt;% request.setCharacterEncoding(quot;UTF-8quot;); %gt; lt;!DOCTYPE HTML PUBLIC quot;-//W3C//DTD HTML 4.01//ENquot; quot;http://www.w3.org/TR/html4/strict.dtdquot;gt; lt;htmlgt; lt;headgt; lt;meta http-equiv=quot;Content-Typequot; content=quot;text/html; charset=UTF-8quot;gt; lt;titlegt;Encoding test processlt;/titlegt; lt;/headgt; lt;bodygt; lt;% String param = request.getParameter(quot;paramquot;); String decoded = new String(param.getBytes(quot;iso-8859-1quot;),quot;UTF-8quot;); System.out.println(quot;Original value --gt; quot;+param); System.out.println(quot;Decoded value --gt; quot;+decoded); %gt; V2lt;brgt; Method: lt;%= request.getMethod() %gt;lt;brgt; Encoding: lt;%= request.getCharacterEncoding() %gt;lt;brgt; Locale: lt;%= request.getLocale() %gt;lt;brgt; Default System charset: lt;%= new java.io.OutputStreamWriter(new java.io.ByteArrayOutputStream()).getEncoding() %gt;lt;brgt; Original value: 'lt;span style='color:lt;%= request.getMethod().equals(quot;GETquot;) ? quot;redquot; : quot;greenquot; %gt;'gt;lt;%= param %gt;lt;/spangt;'lt;brgt; Decoded value: 'lt;span style='color:lt;%= request.getMethod().equals(quot;GETquot;) ? quot;greenquot; : quot;redquot; %gt;'gt;lt;%= decoded %gt;lt;/spangt;'lt;brgt; lt;/tablegt; lt;/bodygt; lt;/htmlgt; My output: Method: POST Encoding: UTF-8 Locale: cs Default System charset: Cp1250 Original value: 'Auml;Atilde;shy;Aring;frac34;ek' Decoded value: 'čížek' Correct output (any other server): V2 Method: POST Encoding: UTF-8 Locale: cs Default System charset: Cp1250 Original value: 'čížek' Decoded value: '?�?ek' I spent last two days googling and looking for answer with no luck. Hope that someone can help. Thanks in advance Jan Pfeifer - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org