I'm trying to set up a simple international servlet. The servlet should
read a parameter from a HTML form and convert it to a proper
java.lang.String, without any restrictions as to the natural language
and the character encoding of this parameter: suppose it is the
customer name in an order form that might come from any country /
platform; or a word submitted to a multi-language search engine or to
a translation service. Not a rare case.
Many suggest the following technique: create one form for each possibile
encoding, statically or dynamically, pre-select the appropriate form in
some way, send the form with the encoding info included in the
"Content-Type" HTTP header and/or in a "Content-Type" META tag and/or in
the form's ACCEPT-CHARSET attribute, and in a hidden charset form field;
receive the parameter from the browser and have the servlet decode it as
follows:
String name=req.getParameter("name");
String enc=req.getParameter("enc");
byte[] bytes=name.getBytes("8859_1");
name=new String(bytes, enc);
[Note: getParameter returns a pseudo-String wrapping each byte in a
char]
I tried this technique successfully with several ISO-8859-X encodings,
and there are examples with Shift_JIS (see [1]).
Unfortunately this technique requires the selection of the appropriate
encoding, that is no easy task: a one-to-one mapping from language
to charset cannot be done; the "Accept-Charset" HTTP request-header
is not always sent/reliable; requesting the encoding to the user is not
practical/elegant.
I thought the charset selection might be avoided at all if one could
use "UTF-8" as the single, universal encoding. Browsers that support
UTF-8 -- like recent versions of NS and IE -- should encode in UTF-8
and send back whatever the user typed in a textbox, if the encoding is
properly requested. So I tried the same technique with UTF-8, under
NT4.0 Server, JSDK2.0, Netscape 4.7. I created an HTML file encoded
in UTF-8 with Java, containing a form with a textbox parameter value
of 3 characters. Comparing the bytes received from the browser with the
expected UTF-8 format bytes I got this result:
1 byte mapping to 1 char c<=127 is sent properly
2 bytes mapping to 1 char 128<=c<=255 are sent properly
2 bytes mapping to 1 char c>255 are replaced by \u003F (QUESTION MARK)
Similar results (i.e. bad encoding of characters beyond \xFF) I obtained
with Amaya.
I suppose there must be common, accepted ways to handle this kind
of problem. I would consider applet-based solutions as the last resort.
I would appreciate any comment/suggestion.
Thanks,
L.P.
-----------
References:
[1] Java Servlet Programming by Jason Hunter, William Crawford, Paula
Ferguson (Editor), December 1998, Chapter 12.
http://www.servlets.com/book
[2] Netscape I18N documents:
http://developer.netscape.com:80/software/jdk/i18n.html
http://people.netscape.com/ftang/paper/unicode11paper/
[3] HTML4.0 Specification:
http://www.w3.org/TR/1999/REC-html401-19991224
[4] HTML Unleashed:
http://www.webreference.com/dlab/books/html/39-0.html
[5] Several postings to this mailing list from Martin Kuba
___________________________________________________________________________
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff SERVLET-INTEREST".
Archives: http://archives.java.sun.com/archives/servlet-interest.html
Resources: http://java.sun.com/products/servlet/external-resources.html
LISTSERV Help: http://www.lsoft.com/manuals/user/user.html