Re: POST request encoding - Tomcat/JVM configuration?

André Warnier Sat, 24 Oct 2009 05:30:43 -0700

Pfeifer Jan wrote:
...

I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more "correct" way to decode String?

Jan,

this whole area of the character set in which HTTP requests come into aserver, and are decoded by the server, is complicated, confusing, andgenerally not well-defined (or defined in contradictory ways) by theInternet RFCs themselves.In short, there can be many reasons why you are not getting the data inthe character set that you expect, and finding the specific reason thatapplies in your case can be tedious and involve several levels.To resolve it, you have to be very systematic, and check every step oneby one.

Here are some principles :

1) the general "default" for the HTTP protocol, and for HTML, isiso-8859-1. Anything else, you have to explicitly specify.iso-8859-1 is at the same time a character set, and an encoding, inwhich each character is represented by one byte.

2) internally, Java represents all character strings as Unicode (whichis a character set), using a 16-bit representation for each character(which is an encoding).

(1) and (2) above mean that somewhere, no matter what, some characterset translation is going to take place, between "the web" and your Javawebapp, and vice-versa between your webapp and the web. The trick is toget the pieces in place so that the /correct/ translations take place ineach direction.

3) iso-8859-1 (in fact all iso-8859-x character sets and encodings) canonly represent each 256 different characters, which is not enough tocover all languages used on the WWW nowadays. So if your applicationshave to use Czech and German at the same time, you should not use aiso-8859 charset.

4) UTF-8 is a popular encoding of Unicode, where each character isrepresented by one or more bytes.The big advantage of Unicode/UTF-8 is that it can represent allcharacters of all languages used on the WWW.The inconvenient of Unicode/UTF-8 at the moment is that, for historicalreasons, it is /not/ the HTTP/HTML default charset, so you have toexplicitly specify it in several places.

5) despite what is said above about the default for HTTP beingiso-8859-1, URLs are an exception. A URL, by definition, is not in anyspecific character set or encoding. The definition of URLs just saysthat, whatever the character set and encoding used, *any byte whosevalue does not match one of the printable characters of the US-ASCIIrange (roughly [0-9A-Za-z] + some), must be encoded in "%AB" notation,where "%AB" is : the "%" sign, followed by a 2-digit hexadecimalrepresentation of the byte value.

In other words it means that, when interpreting data that comes as partof a URL (like the query string in a HTTP GET),- the server first decodes the URI from the "%AB" encoding above, backinto a series of bytes- then the server further decodes this series of bytes into a string ofcharacters, using some charset encoding- but, the only way to know in which character set the data really is,is *by convention* between the client and the server.


The convention, historically so far, has always been iso-8859-1.

Recently and slowly, it seems that this convention is now shiftingtoward UTF-8.But note that it is a convention still, and that in order to make surethat your application (and Tomcat before it) can consider the parametersfrom a GET URL to be UTF-8, /you/ have to make sure that all URLs onwhich a user may click in one of /your/ pages, is indeed encoding theURLs that way.(And thus basically also, if you receive a request from an unknownsource, well, you have to guess..)


See in Tomcat 6.0 docs, the following attribute of the HTTP Connector :

URIEncoding :

This specifies the character encoding used to decode the URI bytes,after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

(The above applies to GET requests, because in that case the requestparameters are passed as part of the URI)


Now about POST requests :

In a POST, the request parameters are not sent as part of a query stringin a URI, but they are sent in the *body* of the request.

There are 2 ways to format a POST request from the client side :
a) as a "url-encoded" body (the default).
b) as a multipart/form-data body.
(That is the case if the <Form> tag contains the attribute :
enctype="multipart/form-data"
)

In (a), the body consists of one long string, which looks like the querystring of a GET :

param1=value1&param2=value2.....&paramn=valuen

The charset and encoding of that string are supposed to be given by the"Content-type" HTTP header of that POST request.


In (b), it is more complicated :

The body of the request is composed of "parts", each part representingone parameter. Each part /should/ have its own Content-type header,indicating the type of that part, and if applicable, the character setand encoding of that part.

In theory thus, there should never be any confusion about the characterset and encoding of POST data.In the practice however, there is a lot, because browsers and serversalike do not always respect the above rules strictly.For example, even modern browsers do not generally indicate a characterset and encoding for the text parts of (b) above.

See, in Tomcat docs, the following attribute of the HTTP Connector, asan example of the confusion :


useBodyEncodingForURI:

This specifies if the encoding specified in contentType should be usedfor URI query parameters, instead of using the URIEncoding. This settingis present for compatibility with Tomcat 4.1.x, where the encodingspecified in the contentType, or explicitely set usingRequest.setCharacterEncoding method was also used for the parametersfrom the URL. The default value is false.

(ndlr: and rightly so)

Strictly speaking thus, the Request.setCharacterEncoding() method/should not exist/, because the character set and encoding of requestdata should always be specified by the browser, and the server shouldnot guess.And the "useBodyEncodingForURI" attribute should not exist either,because the URI may have a charset encoding, but it has nothing to dowith the encoding of the request body.

In the practice, I have found that the following set of "receipes"generally result in predictable results :


1) under Unix/Linux, in the scripts which start Tomcat, make sure that
the process which starts Tomcat is itself started under a UTF-8 locale.
For example, set
LC_ALL="en_US.utf8"; export LC_ALL

(if on your system, "en_US.utf8" is a valid locale. Use "locale -a" tofind out)Under Windows, there is no such "locale" setting available, or I havenever found it. But the Windows JVM seems to always start in a UTF-8mode anyway.


2) to create your application HTML pages :

- use a UTF-8 aware editor, set for UTF-8 text mode, and save all yourpages as Unicode/UTF-8. (Do /not/ use Windows Notepad, because it savesall UTF-8 documents with a leading BOM, which is wrong.)- make sure that all your pages include the following in the HTML <Head>part :

<meta http-equiv="content-type" value="text/html; charset=UTF8" />
- make sure that all your <Form> tags include the following attribute :
<Form .... accept-charset="UTF-8">

3) In theory, you should make sure that whenever your server sends ahtml page to a browser, it includes the proper HTTP "Content-type" inthe response, with the proper charset indication (UTF-8). I don'texactly know how one specifies this explicitly in the case of Tomcat.But it seems that it does it right all by itself.

4) do /not/ use the above "useBodyEncodingForURI" attribute for theTomcat Connectors.

5) If you do all that, /and/ are sure that all URL links in your htmlpages are correctly encoded in UTF-8 + %AB encoding, then also use the

URIEncoding="UTF-8"
attribute of the Tomcat <Connector> tags.


[OT, but not entirely] :

We definitely need a new HTTP 2.0 RFC, where :

- the URI charset/encoding is Unicode/UTF-8 by default, instead ofiso-8859-1

- HTML pages served by servers are UTF-8 by default, instead of iso-8859-1

- browsers using multipart/form-data POST encoding MUST provide a"Content-type" (and, if applicable a "charset" attribute) for each partof the POST

- servers MUST follow the request indications for Content-type
- browsers MUST follow server response indications for Content-type
(and not like IE, make their own guesses)


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: POST request encoding - Tomcat/JVM configuration?

Reply via email to