Pfeifer Jan wrote:
...
I know about URIEncoding in server.xml and about using Encoding filter,but we use this for decoding GET request for historical reasons. Or is there more "correct" way to decode String?
Jan,
this whole area of the character set in which HTTP requests come into a
server, and are decoded by the server, is complicated, confusing, and
generally not well-defined (or defined in contradictory ways) by the
Internet RFCs themselves.
In short, there can be many reasons why you are not getting the data in
the character set that you expect, and finding the specific reason that
applies in your case can be tedious and involve several levels.
To resolve it, you have to be very systematic, and check every step one
by one.
Here are some principles :
1) the general "default" for the HTTP protocol, and for HTML, is
iso-8859-1. Anything else, you have to explicitly specify.
iso-8859-1 is at the same time a character set, and an encoding, in
which each character is represented by one byte.
2) internally, Java represents all character strings as Unicode (which
is a character set), using a 16-bit representation for each character
(which is an encoding).
(1) and (2) above mean that somewhere, no matter what, some character
set translation is going to take place, between "the web" and your Java
webapp, and vice-versa between your webapp and the web. The trick is to
get the pieces in place so that the /correct/ translations take place in
each direction.
3) iso-8859-1 (in fact all iso-8859-x character sets and encodings) can
only represent each 256 different characters, which is not enough to
cover all languages used on the WWW nowadays. So if your applications
have to use Czech and German at the same time, you should not use a
iso-8859 charset.
4) UTF-8 is a popular encoding of Unicode, where each character is
represented by one or more bytes.
The big advantage of Unicode/UTF-8 is that it can represent all
characters of all languages used on the WWW.
The inconvenient of Unicode/UTF-8 at the moment is that, for historical
reasons, it is /not/ the HTTP/HTML default charset, so you have to
explicitly specify it in several places.
5) despite what is said above about the default for HTTP being
iso-8859-1, URLs are an exception. A URL, by definition, is not in any
specific character set or encoding. The definition of URLs just says
that, whatever the character set and encoding used, *any byte whose
value does not match one of the printable characters of the US-ASCII
range (roughly [0-9A-Za-z] + some), must be encoded in "%AB" notation,
where "%AB" is : the "%" sign, followed by a 2-digit hexadecimal
representation of the byte value.
In other words it means that, when interpreting data that comes as part
of a URL (like the query string in a HTTP GET),
- the server first decodes the URI from the "%AB" encoding above, back
into a series of bytes
- then the server further decodes this series of bytes into a string of
characters, using some charset encoding
- but, the only way to know in which character set the data really is,
is *by convention* between the client and the server.
The convention, historically so far, has always been iso-8859-1.
Recently and slowly, it seems that this convention is now shifting
toward UTF-8.
But note that it is a convention still, and that in order to make sure
that your application (and Tomcat before it) can consider the parameters
from a GET URL to be UTF-8, /you/ have to make sure that all URLs on
which a user may click in one of /your/ pages, is indeed encoding the
URLs that way.
(And thus basically also, if you receive a request from an unknown
source, well, you have to guess..)
See in Tomcat 6.0 docs, the following attribute of the HTTP Connector :
URIEncoding :
This specifies the character encoding used to decode the URI bytes,
after %xx decoding the URL. If not specified, ISO-8859-1 will be used.
(The above applies to GET requests, because in that case the request
parameters are passed as part of the URI)
Now about POST requests :
In a POST, the request parameters are not sent as part of a query string
in a URI, but they are sent in the *body* of the request.
There are 2 ways to format a POST request from the client side :
a) as a "url-encoded" body (the default).
b) as a multipart/form-data body.
(That is the case if the <Form> tag contains the attribute :
enctype="multipart/form-data"
)
In (a), the body consists of one long string, which looks like the query
string of a GET :
param1=value1¶m2=value2.....¶mn=valuen
The charset and encoding of that string are supposed to be given by the
"Content-type" HTTP header of that POST request.
In (b), it is more complicated :
The body of the request is composed of "parts", each part representing
one parameter. Each part /should/ have its own Content-type header,
indicating the type of that part, and if applicable, the character set
and encoding of that part.
In theory thus, there should never be any confusion about the character
set and encoding of POST data.
In the practice however, there is a lot, because browsers and servers
alike do not always respect the above rules strictly.
For example, even modern browsers do not generally indicate a character
set and encoding for the text parts of (b) above.
See, in Tomcat docs, the following attribute of the HTTP Connector, as
an example of the confusion :
useBodyEncodingForURI:
This specifies if the encoding specified in contentType should be used
for URI query parameters, instead of using the URIEncoding. This setting
is present for compatibility with Tomcat 4.1.x, where the encoding
specified in the contentType, or explicitely set using
Request.setCharacterEncoding method was also used for the parameters
from the URL. The default value is false.
(ndlr: and rightly so)
Strictly speaking thus, the Request.setCharacterEncoding() method
/should not exist/, because the character set and encoding of request
data should always be specified by the browser, and the server should
not guess.
And the "useBodyEncodingForURI" attribute should not exist either,
because the URI may have a charset encoding, but it has nothing to do
with the encoding of the request body.
In the practice, I have found that the following set of "receipes"
generally result in predictable results :
1) under Unix/Linux, in the scripts which start Tomcat, make sure that
the process which starts Tomcat is itself started under a UTF-8 locale.
For example, set
LC_ALL="en_US.utf8"; export LC_ALL
(if on your system, "en_US.utf8" is a valid locale. Use "locale -a" to
find out)
Under Windows, there is no such "locale" setting available, or I have
never found it. But the Windows JVM seems to always start in a UTF-8
mode anyway.
2) to create your application HTML pages :
- use a UTF-8 aware editor, set for UTF-8 text mode, and save all your
pages as Unicode/UTF-8. (Do /not/ use Windows Notepad, because it saves
all UTF-8 documents with a leading BOM, which is wrong.)
- make sure that all your pages include the following in the HTML <Head>
part :
<meta http-equiv="content-type" value="text/html; charset=UTF8" />
- make sure that all your <Form> tags include the following attribute :
<Form .... accept-charset="UTF-8">
3) In theory, you should make sure that whenever your server sends a
html page to a browser, it includes the proper HTTP "Content-type" in
the response, with the proper charset indication (UTF-8). I don't
exactly know how one specifies this explicitly in the case of Tomcat.
But it seems that it does it right all by itself.
4) do /not/ use the above "useBodyEncodingForURI" attribute for the
Tomcat Connectors.
5) If you do all that, /and/ are sure that all URL links in your html
pages are correctly encoded in UTF-8 + %AB encoding, then also use the
URIEncoding="UTF-8"
attribute of the Tomcat <Connector> tags.
[OT, but not entirely] :
We definitely need a new HTTP 2.0 RFC, where :
- the URI charset/encoding is Unicode/UTF-8 by default, instead of
iso-8859-1
- HTML pages served by servers are UTF-8 by default, instead of iso-8859-1
- browsers using multipart/form-data POST encoding MUST provide a
"Content-type" (and, if applicable a "charset" attribute) for each part
of the POST
- servers MUST follow the request indications for Content-type
- browsers MUST follow server response indications for Content-type
(and not like IE, make their own guesses)
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org