I read some code in catalina & jasper, and found that:
There is a setCharacterEncoding() for servlet request now; but I greped all Tomcat
code, and found nowhere called it. It means, by default, Tomcat use a default encoding
of '8859_1'. There is no option in server.xml/web.xml for tomcat to set a default 
encoding
for a context/container(or whatever) to use a default encoding other than '8859_1'.

Also, the alternative (JSP compiling) encoding option in conf/web.xml for jasper
seems failed to work (at least, failed for JSP pages in big5 encoding).
When there is no '<% page contentType="text/html; charset=xxx" %>' in a JSP,
jasper use '8859_1' as its the JSP's default encoding, oops.

We are working on a product deploying JSP pages which targeting multiple
markets in Japan, Taiwan, and probably China mainland. Sure, when we maintain
our JSP pages (initially show messages in english, but should be able to handle
input in localized character encodings), we don't like to maintain 3 versions of
JSP pages with each version of them differed only in the page directive:
'<% page contentType="text/html; charset=xxx" %>'


And, I found Tomcat does byte->char typecast first and then char->byte typecast
back before converting bytes into a java string. Unfortunately, because the character
encoding is never changed from '8859_1' to some other customized one assigned
in somewhere other than in code.

This seems to work at first, as long as you don't treat strings read from GET/POST
parameters as Unicode strings, because they are NOT VALID UNICODE STRINGS.
Web output generated from servlets/JSP pages may be right, simply because contents
in these NOT VALID UNICODE STRINGS are converted into bytes again by simply
doing char->byte typecasting.

Oops! It goes too far. People can't just do internalization/localization in such a
"garbage in garbage out" solution. Maybe it looks right both in the input/output ends,
if you simply GET/POST something and out.println(xxx.getParameter("foo")).
But if you are doing something serious with character encodings other than 8859_1
(if Big5, GB2312 and Shift_JIS are for localization and not serious enough, how about
utf-8 character encoding? indeed, Tomcat garbaged GET/POST inputs in utf-8 encoding),
you must handle this problem.

Personally, I code my own connector to aim this problem. The connector works as a
bridge from Sun's Brazil web server (a light-weight web server in 100% java), Brazil
HTTP request objects are passed directly into the connector (rather than via some 
socket
protocl), such that the connector does configure servlets/JSP pages to use a default 
encoding
given by properties set in the Brazil configuration file, and it does URL encoding 
check against
raw strings input in GET/POST parameters in localized character encoding, as to make 
sure
Tomcat does right character conversions for these parameters. (the %xx URL decoding
code in parseParameters() in Tomcat 4 beta 3/4 works fine, but the 
byte->char/char->byte
code drops some characters) But there is no way to modify jasper's default compiling 
encoding,
except modify its code.


Reply via email to