Re: UTF-8 and getParameter() WAS: unicode special characters

Marco Trevisan Wed, 30 May 2001 13:59:10 -0700
Hi Tomas,

Just to be clearer, we are actually supporting IE 5.0/5.5 and all
experiments are conducted
on this client and test were conducted with Tomcat alone and in conjunction
with Apache (Linux/Win32).
Having said this, we found that after setting the charset with
setContentType
the browser started responding with UTF-8, as we even discovered browsing
through MSDN.
We don't specify any charset for the form, it's the browser that starts to
respond
with UTF-8 post data.
Analizing the string coming out of the request I saw that character
sequencences where in
"quasi" UTF-8 and digging in source code of Tomcat I discovered the
following line,
Hashtable postParameters =  HttpUtils.parsePostData(contentLength, is);

After decompilation of servlet.jar (the one bundled with Tomcat, I think
it's at least 2.1 API compliant)
I found the following line
String postedBody=new String(postedBytes,0,len,"8859_1");
so I argued that this line was the source of noise.
In particular for the euro symbol the sequence of characters is:
euroAs8859=new String(new char[]{(int)226,(int)130,(int)172});
while the desired sequence, from my point of view is
euroAsUTF8=new String(new byte[]{(byte)-30,(byte)-126,(byte)-84});

So I resolved to translate the sequence with a character substitution before
appling UTF-8 decoding (at least 2 versions are present in the jdk source).
I was lucky because the routine that fetches data is unique and used by the
entire application.
I would like to know if Resin has a different behaviour, and from your words
I guess yes since:
1) you use setContentType("text/html; charset=UTF-8") without an ending
semicolon
wich I haven't seen being used anywhere
2) you get 1 character (I assume it's the right character)

Hope this helps,
Marco
----- Original Message -----
[cut]
> If I set res.setContentType("text/html; charset=UTF-8;"); in my servlet,
and
> I will POST data to server, for each spedial character (not in
ISO-8859-1),
> it will send that character in HEXA as 2 bytes - for example, if I send to
> the server small "e" with stud/wedge, the server gets text=%C4%9B , which
is
> correct value in UTF-8 for that character in hexa.
>
> And now acme the main guestion , what the servlet angine will do with it.
I
> use Resin 1.2.5 and if I do
> String par = req.getParamater("text"); for that 1 e with stud, the par
> string will have length() of 2 which is OK and it will display OK on the
> page as 1 character thanks to the header.
>
> but If I set setContentType("text/html; charset=UTF-8"); without the last
> ";" after UTF-8, the par string length() is only 1 !! and It displays Ok
on
> the page too.
>
> Does anyone here has working UTF-8 site ? I have spend too much hours
trying
> to find good solution on this problem.
>
> (btw1.: I use the <form> as <form method=\"POST\"
accept-charset=\"UTF-8\">)
> (btw2.: I think this mass is because Javas char has 2 bytes by default)
[cut]

___________________________________________________________________________
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff SERVLET-INTEREST".

Archives: http://archives.java.sun.com/archives/servlet-interest.html
Resources: http://java.sun.com/products/servlet/external-resources.html
LISTSERV Help: http://www.lsoft.com/manuals/user/user.html
Re: UTF-8 and getParameter() WAS: unicode special characters

Reply via email to