Chris-
See answers below...
Christopher Schultz wrote:
I just ran it through Wireshark and followed the TCP stream verifying
that it's being encoded correctly into 3 bytes - it is (E2 80 A2).
Is this the UTF-8 code that /should/ represent those bullet characters?
I would assume so. I had a co-worker use BBEdit on his mac to convert a
document to from UTF-8 from other formats and it appears to convert them
back and forth correctly. However, the W3C recommends that
enctype=multipart/form-data be "used for submitting forms that contain
files, non-ASCII data, and binary data." However, setting this and
leaving the acceptCharset and the @page directive as UTF-8 results in
Wireshark not reporting any encoding, the bullet character shows up as 3
. characters.
Regardless, if we take the string that the 3 characters, E2 80 A2, are
in, call charAt() for each character in the string, place all the
results in a byte array, and construct a new string from those bytes,
Java correctly recognizes that these characters represent a UNICODE
character (I believe since String by default represents a UNICODE
string) and the string length decreases by 2.
I'm going to follow up with Paul's post and try it on 1.3.8 and see
if I can reproduce. Basically, the behavior we're seeing is that the
3 bytes are being treated as separate characters and not as one
UNICODE character.
Can you confirm that the Content-Type of the form is being submitted
with the request properly (as an HTTP header) and that the Request
object on the server-side correctly reads the Content-Type header?
Yes, using "wget -S" prints the HTTP response headers and we can see the
Content-Type header correctly set to UTF-8.
--adam
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]