Re: [Proposal] Default Encoding option for JSP/Tomcat in server.xml or web.xml

2001-05-12 Thread Alec Yu

From: "Craig R. McClanahan" <[EMAIL PROTECTED]>
> Servlet Specification 2.3 (Proposed Final Draft 2), Section 5.4 (p. 44):
> 
> 'The default encoding of a response is "ISO-8859-1"
> if none has been specified by the servlet programmer.'
I am a servlet programmer also,
why can't I specified it in the container configuration files...*giggle*

> Providing container-level overrides for this would seem to break the spec,
> and any application that depended on that features would not be portable
> to other containers.
Suppose we are developing a web product in JSP, targeting 3 markets
(say, Japan, Taiwan & Korea).

Meanwhile, our product co-operate with some other servlet/JSP-based
product(s) from 3rd party vendors. 

The concern is:
If there is no way to set a default encoding in a web.xml/server.xml or whatever
configuration files used by the servlet/jsp engine, then we have to, not only modify
our code & pages, but also those from other vendors.

More worse, how about those servlets come without source code?

Let's see a real example: (my personal web site)
Sun's Brazil web server acts as the front-end web server (because it's light weight,
responds faster), with my own brazil-to-tomcat connector (invoke servlets/jsp pages
via direct java calls, not via socket connections).

Everything is fine, until Jive (a free forum system in JSP & beans) involves into this 
system.
Wow. Jive can't handle Big5, Shift_JIS, GB2312 or anything else like utf-8; only ISO 
8859-1
works fine.

Hell, should I modify Jive again and again and again, when Jive updates so often?
How about some new custom Jive skins from somewhere around the world?
How about other 3rd party JSP pages?

The servlet/JSP specifications made me feel that:
they only aimed at L10N problems, not I18N problems.

> > This seems to work at first, as long as you don't treat strings read
> > from GET/POST parameters as Unicode strings, because they are NOT
> > VALID UNICODE STRINGS. Web output generated from servlets/JSP pages
> > may be right, simply because contents in these NOT VALID UNICODE
> > STRINGS are converted into bytes again by simply doing char->byte
> > typecasting.
> For GET requests, there are not very many good solutions because the
> request itself does not include information about the character encoding
> that was used on the requset URI.
Yes, I read something years ago similar to this explaining about why a standard for
determining GET parameters not existing..

> Could you point me specifically to the byte->char/char->byte code that you
> are concerned about?
Hmm..Thank you for lots of explains.
Indeed, what I'm talking about is not broken.
After following the spec more closely, it's ok now.

> You are obviously free to do this kind of special connector, and/or modify
> Tomcat to meet your needs -- but you're also making yourself dependent on
> conventions that are contrary to the servlet and JSP specifications.  Any
> apps you write that depend on this behavior won't run on any other servers
> that implement the standards.  You might want to look at standards based
> alternatives to at least some of the issues that you have raised.
I just feel curious, why the standard specifications cost people here so much 
maintainance time,
just because they don't allow us to specify default encodings for compilation time, 
input time
and runtime once only in some few configuration files, but force us to specify them in 
every
pages & every servlet code. Meanwhile, in this manner, as our products co-operate with
those code/pages come from other people, we have to ask their developers:
May you send us a copy of source code/pages?
May you take concern on some character encodings other than your own using one?
May you ..

What an I18 solution looks like this.
Sure,  UTF-8 greatly eased the problems on input & output, but it does not solve
the maintainance problem on other people's code/pages. And, not everybody willing
to take UTF-8 as their default encoding, because only few tools are being able to
edit UTF-8 documents (Let's forget M$ FrontPage, it surely with poor support to JSPs;
Dreamweaver is great, but lack of UTF-8 support; Amaya has poor DBCS support,
not mentioning JSP; even among plain text editors, there are few suppoting UTF-8).

You know, lots of, if not most, JSP pages around the world come with no page 
contentType directives,
many servlets do not even specify their own character encoding, or do not provide an 
option in some
configuration files to do so. The real nightmare is not in our own servlets/pages, but 
in other people's.

ps.
I am a newbie, not knowing how to make code submission to Apache projects.
I installed JAMES 1.2.1 on my personal web site, and found it garbaged 8-bit MIME mail 
headers.
I fixed it, and put SMTP AUTH LOGIN function into its SMTP handler.
(such that, you may put a matcher to allow mail relay by checking a

[Proposal] Default Encoding option for JSP/Tomcat in server.xml or web.xml

2001-05-11 Thread Alec Yu

I read some code in catalina & jasper, and found that:
There is a setCharacterEncoding() for servlet request now; but I greped all Tomcat
code, and found nowhere called it. It means, by default, Tomcat use a default encoding
of '8859_1'. There is no option in server.xml/web.xml for tomcat to set a default 
encoding
for a context/container(or whatever) to use a default encoding other than '8859_1'.

Also, the alternative (JSP compiling) encoding option in conf/web.xml for jasper
seems failed to work (at least, failed for JSP pages in big5 encoding).
When there is no '<% page contentType="text/html; charset=xxx" %>' in a JSP,
jasper use '8859_1' as its the JSP's default encoding, oops.

We are working on a product deploying JSP pages which targeting multiple
markets in Japan, Taiwan, and probably China mainland. Sure, when we maintain
our JSP pages (initially show messages in english, but should be able to handle
input in localized character encodings), we don't like to maintain 3 versions of
JSP pages with each version of them differed only in the page directive:
'<% page contentType="text/html; charset=xxx" %>'


And, I found Tomcat does byte->char typecast first and then char->byte typecast
back before converting bytes into a java string. Unfortunately, because the character
encoding is never changed from '8859_1' to some other customized one assigned
in somewhere other than in code.

This seems to work at first, as long as you don't treat strings read from GET/POST
parameters as Unicode strings, because they are NOT VALID UNICODE STRINGS.
Web output generated from servlets/JSP pages may be right, simply because contents
in these NOT VALID UNICODE STRINGS are converted into bytes again by simply
doing char->byte typecasting.

Oops! It goes too far. People can't just do internalization/localization in such a
"garbage in garbage out" solution. Maybe it looks right both in the input/output ends,
if you simply GET/POST something and out.println(xxx.getParameter("foo")).
But if you are doing something serious with character encodings other than 8859_1
(if Big5, GB2312 and Shift_JIS are for localization and not serious enough, how about
utf-8 character encoding? indeed, Tomcat garbaged GET/POST inputs in utf-8 encoding),
you must handle this problem.

Personally, I code my own connector to aim this problem. The connector works as a
bridge from Sun's Brazil web server (a light-weight web server in 100% java), Brazil
HTTP request objects are passed directly into the connector (rather than via some 
socket
protocl), such that the connector does configure servlets/JSP pages to use a default 
encoding
given by properties set in the Brazil configuration file, and it does URL encoding 
check against
raw strings input in GET/POST parameters in localized character encoding, as to make 
sure
Tomcat does right character conversions for these parameters. (the %xx URL decoding
code in parseParameters() in Tomcat 4 beta 3/4 works fine, but the 
byte->char/char->byte
code drops some characters) But there is no way to modify jasper's default compiling 
encoding,
except modify its code.