Pfeifer Jan wrote:
...

I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more "correct" way to decode String?
Jan,

this whole area of the character set in which HTTP requests come into a server, and are decoded by the server, is complicated, confusing, and generally not well-defined (or defined in contradictory ways) by the Internet RFCs themselves. In short, there can be many reasons why you are not getting the data in the character set that you expect, and finding the specific reason that applies in your case can be tedious and involve several levels. To resolve it, you have to be very systematic, and check every step one by one.
Here are some principles :

1) the general "default" for the HTTP protocol, and for HTML, is iso-8859-1. Anything else, you have to explicitly specify. iso-8859-1 is at the same time a character set, and an encoding, in which each character is represented by one byte.

2) internally, Java represents all character strings as Unicode (which is a character set), using a 16-bit representation for each character (which is an encoding).

(1) and (2) above mean that somewhere, no matter what, some character set translation is going to take place, between "the web" and your Java webapp, and vice-versa between your webapp and the web. The trick is to get the pieces in place so that the /correct/ translations take place in each direction.

3) iso-8859-1 (in fact all iso-8859-x character sets and encodings) can only represent each 256 different characters, which is not enough to cover all languages used on the WWW nowadays. So if your applications have to use Czech and German at the same time, you should not use a iso-8859 charset.

4) UTF-8 is a popular encoding of Unicode, where each character is represented by one or more bytes. The big advantage of Unicode/UTF-8 is that it can represent all characters of all languages used on the WWW. The inconvenient of Unicode/UTF-8 at the moment is that, for historical reasons, it is /not/ the HTTP/HTML default charset, so you have to explicitly specify it in several places.

5) despite what is said above about the default for HTTP being iso-8859-1, URLs are an exception. A URL, by definition, is not in any specific character set or encoding. The definition of URLs just says that, whatever the character set and encoding used, *any byte whose value does not match one of the printable characters of the US-ASCII range (roughly [0-9A-Za-z] + some), must be encoded in "%AB" notation, where "%AB" is : the "%" sign, followed by a 2-digit hexadecimal representation of the byte value.

In other words it means that, when interpreting data that comes as part of a URL (like the query string in a HTTP GET), - the server first decodes the URI from the "%AB" encoding above, back into a series of bytes - then the server further decodes this series of bytes into a string of characters, using some charset encoding - but, the only way to know in which character set the data really is, is *by convention* between the client and the server.

The convention, historically so far, has always been iso-8859-1.
Recently and slowly, it seems that this convention is now shifting toward UTF-8. But note that it is a convention still, and that in order to make sure that your application (and Tomcat before it) can consider the parameters from a GET URL to be UTF-8, /you/ have to make sure that all URLs on which a user may click in one of /your/ pages, is indeed encoding the URLs that way. (And thus basically also, if you receive a request from an unknown source, well, you have to guess..)

See in Tomcat 6.0 docs, the following attribute of the HTTP Connector :

URIEncoding :   
This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

(The above applies to GET requests, because in that case the request parameters are passed as part of the URI)

Now about POST requests :

In a POST, the request parameters are not sent as part of a query string in a URI, but they are sent in the *body* of the request.
There are 2 ways to format a POST request from the client side :
a) as a "url-encoded" body (the default).
b) as a multipart/form-data body.
(That is the case if the <Form> tag contains the attribute :
enctype="multipart/form-data"
)

In (a), the body consists of one long string, which looks like the query string of a GET :
param1=value1&param2=value2.....&paramn=valuen
The charset and encoding of that string are supposed to be given by the "Content-type" HTTP header of that POST request.

In (b), it is more complicated :
The body of the request is composed of "parts", each part representing one parameter. Each part /should/ have its own Content-type header, indicating the type of that part, and if applicable, the character set and encoding of that part.

In theory thus, there should never be any confusion about the character set and encoding of POST data. In the practice however, there is a lot, because browsers and servers alike do not always respect the above rules strictly. For example, even modern browsers do not generally indicate a character set and encoding for the text parts of (b) above.

See, in Tomcat docs, the following attribute of the HTTP Connector, as an example of the confusion :

useBodyEncodingForURI:  
This specifies if the encoding specified in contentType should be used for URI query parameters, instead of using the URIEncoding. This setting is present for compatibility with Tomcat 4.1.x, where the encoding specified in the contentType, or explicitely set using Request.setCharacterEncoding method was also used for the parameters from the URL. The default value is false.
(ndlr: and rightly so)

Strictly speaking thus, the Request.setCharacterEncoding() method /should not exist/, because the character set and encoding of request data should always be specified by the browser, and the server should not guess. And the "useBodyEncodingForURI" attribute should not exist either, because the URI may have a charset encoding, but it has nothing to do with the encoding of the request body.


In the practice, I have found that the following set of "receipes" generally result in predictable results :

1) under Unix/Linux, in the scripts which start Tomcat, make sure that
the process which starts Tomcat is itself started under a UTF-8 locale.
For example, set
LC_ALL="en_US.utf8"; export LC_ALL
(if on your system, "en_US.utf8" is a valid locale. Use "locale -a" to find out) Under Windows, there is no such "locale" setting available, or I have never found it. But the Windows JVM seems to always start in a UTF-8 mode anyway.

2) to create your application HTML pages :
- use a UTF-8 aware editor, set for UTF-8 text mode, and save all your pages as Unicode/UTF-8. (Do /not/ use Windows Notepad, because it saves all UTF-8 documents with a leading BOM, which is wrong.) - make sure that all your pages include the following in the HTML <Head> part :
<meta http-equiv="content-type" value="text/html; charset=UTF8" />
- make sure that all your <Form> tags include the following attribute :
<Form .... accept-charset="UTF-8">

3) In theory, you should make sure that whenever your server sends a html page to a browser, it includes the proper HTTP "Content-type" in the response, with the proper charset indication (UTF-8). I don't exactly know how one specifies this explicitly in the case of Tomcat. But it seems that it does it right all by itself.

4) do /not/ use the above "useBodyEncodingForURI" attribute for the Tomcat Connectors.

5) If you do all that, /and/ are sure that all URL links in your html pages are correctly encoded in UTF-8 + %AB encoding, then also use the
URIEncoding="UTF-8"
attribute of the Tomcat <Connector> tags.


[OT, but not entirely] :

We definitely need a new HTTP 2.0 RFC, where :
- the URI charset/encoding is Unicode/UTF-8 by default, instead of iso-8859-1
- HTML pages served by servers are UTF-8 by default, instead of iso-8859-1
- browsers using multipart/form-data POST encoding MUST provide a "Content-type" (and, if applicable a "charset" attribute) for each part of the POST
- servers MUST follow the request indications for Content-type
- browsers MUST follow server response indications for Content-type
(and not like IE, make their own guesses)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to