Re: Input from a FORM - encoding problem SOLVED

2002-02-20 Thread Nikola Milutinovic

The solution was to set the character encoding on the request (not on the response) 
object. Aparently, the parameters of the request are fetched on method call, which is 
a nice thing :-)

Thanks to all who helped.

And, by the way, IE6 doesn't honour enctype of the FORM, just splashes it's default, 
which doesn't include encoding info.

Nix.



RE: Input from a FORM - encoding problem

2002-02-19 Thread Satoshi Okamoto

if its servlet, try this..

response.setContentType(text/html;charset=UR ENCODING TYPE);
PrintWriter out = new PrintWriter( new
OutputStreamWriter(response.getOutputStream(), UR ENCODING TYPE));

-Original Message-
From: Attila Szegedi [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, February 19, 2002 5:16 PM
To: Tomcat Users List
Subject: Re: Input from a FORM - encoding problem


OK: he might try. I admit I've not used IE6, only IEs up to 5.5 and NN up to
4.72, but it's a fact that:

- these browsers never appended a charset declaration to the Content-Type
header (i.e. Content-Type: application/x-form-urlencoded and not
Content-Type: application/x-form-urlencoded; charset=iso-8859-2 so it was
up to the server side to figure out what the charset was.

- Tomcat 3.2.x blindly decoded form data as ISO-8859-1 (in fact, it is the
code in javax.servlet.http.HttpUtils#parsePostData() method which contains
the following much revealing comment:
quote
// XXX we shouldn't assume that the only kind of POST body
// is FORM data encoded using ASCII or ISO Latin/1 ... or
// that the body should always be treated as FORM data.

/quote
So, even if your browser acts to the spec, Tomcat 3.2.x certainly does not.
I must underline that I don't know if 3.3.x or 4.x Tomcats rely on this
(flawed) code or not. Tomcat 4.x definitely should not, since it is supposed
to implement request.setCharacterEncoding()...

Cheers,
  Attila.

--
Attila Szegedi
home: http://www.szegedi.org


- Original Message -
From: Arnold Shore [EMAIL PROTECTED]
To: Tomcat Users List [EMAIL PROTECTED]
Sent: 2002. febru? 18. 16:58
Subject: RE: Input from a FORM - encoding problem


 Re Don't bother fiddling with FORM attributes. I've done this before to
 no avail:

 I'm accepting Arabic, Hebrew, Russian, and Chinese doing exactly that,
with
 IE 6 and using Unicode encodings. (Will be trying NN and Opera shortly.)
And
 yes, I'm also using that encoding on the page.

 It's going into a database, with subsequent retrieval and display.  Works
 correctly for the stuff I've tried.

 Arnold Shore
 Annapolis, MD USA

 -Original Message-
 From: Attila Szegedi [mailto:[EMAIL PROTECTED]]
 Sent: Monday, February 18, 2002 9:39 AM
 To: Tomcat Users List
 Subject: Re: Input from a FORM - encoding problem


 Don't bother fiddling with FORM attributes. I've done this before to no
 avail.

 Right now, no matter what you specify as an encoding in a HTML page, most
 browsers (all favorite IE and NN flavors) ignore it altogether and encode
 the form data using the encoding in which the page containing the form was
 sent to them. Worse yet, they *don't* specify the encoding of characters
in
 the form data when sending them back via a POST request, so you must know
on
 the server side what was the encoding of the page that contained the form.
 Servlet 2.3 spec is meant to contain a solution for this, but I don't know
 how is it (or isn't) implemented in Tomcat 4.x.

 As if all of the above weren't enough, Tomcat 3.x gives yet another stab
to
 internationalization efforts: it will blindly interpret all form data as
 being iso-8859-1 (~ Cp1252), so your iso-8859-2 (~Cp1250) characters are
 lost. Again, I don't know how Tomcat 4.x line handles this.

 Being a Hungarian, I'm just as interested in entering 8859-2 characters in
 my pages, and not seeing ? marks on the server side, so I'm transcoding
all
 form data strings on the fly. The off-the-wall solution looks like this:

 param = new String(param.getBytes(8859_1), 8859_2);

 altough this tends to be slow (running through Java char-to-byte, then
 through byte-to-char machinery). I have developed a fast 8859-1 to 8859-2
 transcoder that addresses speed issues; contact me in private mail and I
can
 send it to you.

 Cheers,
   Attila.
 --
 Attila Szegedi
 home: http://www.szegedi.org

 - Original Message -
 From: Nikola Milutinovic [EMAIL PROTECTED]
 To: Tomcat Users List [EMAIL PROTECTED]
 Sent: 2002. febru? 18. 15:17
 Subject: Re: Input from a FORM - encoding problem


   quote
   FORM attribute
  
   accept-charset = charset list [CI]
   This attribute specifies the list of character encodings for input
 data that is accepted by the server processing this form. The value is a
 space- and/or comma-delimited list of charset values. The client must
 interpret this list as an
   exclusive-or list, i.e., the server is able to accept any single
 character encoding per entity received.
 
  This bit is a bit unclear to me. If I specify several encodings, how
 will the browser know which one was actually used? How will the server
know
 which one was used?
 
  Nix.
 


 --
 To unsubscribe:   mailto:[EMAIL PROTECTED]
 For additional commands: mailto:[EMAIL PROTECTED]
 Troubles with the list: mailto:[EMAIL PROTECTED]






--
To unsubscribe:   mailto:[EMAIL PROTECTED]
For additional commands: mailto:[EMAIL PROTECTED]
Troubles with the list: mailto:[EMAIL PROTECTED]



--
To unsubscribe

Re: Input from a FORM - encoding problem

2002-02-18 Thread David Cassidy

try this ...

quote
FORM attribute

accept-charset = charset list [CI]
This attribute specifies the list of character encodings for input data that is 
accepted by the server processing this form. The value is a space- and/or 
comma-delimited list of charset values. The client must interpret this list as an
exclusive-or list, i.e., the server is able to accept any single character encoding 
per entity received.

The default value for this attribute is the reserved string UNKNOWN. User agents 
may interpret this value as the character encoding that was used to transmit the 
document containing this FORM element.
/quote

URLhttp://www.w3.org/TR/html401/interact/forms.html#h-17.3


Let us know ...

Thanks

D




Nikola Milutinovic wrote:

 Hi all.

 I have a HTML FORM that I'd like to use to update data in my database. DB 
(PostgreSQL + Unicode) is configured and correctly loaded with Unicode data. 
Translations from UTF-8 - Win-1250 works like a charm (and so does UTF-8 - 
ISO-8859-2).

 In other words, displaying the data is OK.

 Now I want to update fields and there is a problem. If I enter some of win-1250 
chars in a textfield it gets translated to ?.

 A simple investigation shows that the loathed Win1250 - '?' occurs within the 
HTTPRequest object creation.

 How do I specify that the data coming from a FORM is Win1250 encoded?
 Do I do that in HTML FORM that submits the data (most likely)?
 Or do I do that in the JSP/Servlet accepting the data (highly unlikely)?

 I'm looking at HTML 4.01 specification, but so far I'm unlucky - nothing seams to 
work.

 Nix.


--
To unsubscribe:   mailto:[EMAIL PROTECTED]
For additional commands: mailto:[EMAIL PROTECTED]
Troubles with the list: mailto:[EMAIL PROTECTED]




RE: Input from a FORM - encoding problem

2002-02-18 Thread Arnold Shore

I'm using something like the ff, which works for me with IE6 and IIS:
FORM ACCEPT-CHARSET=UTF-8 METHOD= ...

Arnold Shore
Annapolis, MD USA

-Original Message-
From: Nikola Milutinovic [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 18, 2002 8:45 AM
To: Tomcat Users List
Subject: Input from a FORM - encoding problem


... Do I do that in HTML FORM that submits the data (most likely)?
Or do I do that in the JSP/Servlet accepting the data (highly unlikely)?

I'm looking at HTML 4.01 specification, but so far I'm unlucky - nothing
seams to work.

Nix.


--
To unsubscribe:   mailto:[EMAIL PROTECTED]
For additional commands: mailto:[EMAIL PROTECTED]
Troubles with the list: mailto:[EMAIL PROTECTED]




Re: Input from a FORM - encoding problem

2002-02-18 Thread Nikola Milutinovic

 try this ...
 
 quote
 FORM attribute
 
 accept-charset = charset list [CI]
 This attribute specifies the list of character encodings for input data that is 
accepted by the server processing this form. The value is a space- and/or 
comma-delimited list of charset values. The client must interpret this list as an
 exclusive-or list, i.e., the server is able to accept any single character encoding 
per entity received.

Nothing yet. I've tried it. I'll try ISO-8859-2 tomorrow.

How doeas it work anyway? What does HTTP request have in headers for this encoding?

Nix.



Re: Input from a FORM - encoding problem

2002-02-18 Thread Nikola Milutinovic

 quote
 FORM attribute
 
 accept-charset = charset list [CI]
 This attribute specifies the list of character encodings for input data that is 
accepted by the server processing this form. The value is a space- and/or 
comma-delimited list of charset values. The client must interpret this list as an
 exclusive-or list, i.e., the server is able to accept any single character encoding 
per entity received.

This bit is a bit unclear to me. If I specify several encodings, how will the 
browser know which one was actually used? How will the server know which one was used?

Nix.



Re: Input from a FORM - encoding problem

2002-02-18 Thread Attila Szegedi

Don't bother fiddling with FORM attributes. I've done this before to no avail.

Right now, no matter what you specify as an encoding in a HTML page, most browsers 
(all favorite IE and NN flavors) ignore it altogether and encode the form data using 
the encoding in which the page containing the form was sent to them. Worse yet, they 
*don't* specify the encoding of characters in the form data when sending them back via 
a POST request, so you must know on the server side what was the encoding of the page 
that contained the form. Servlet 2.3 spec is meant to contain a solution for this, but 
I don't know how is it (or isn't) implemented in Tomcat 4.x.

As if all of the above weren't enough, Tomcat 3.x gives yet another stab to 
internationalization efforts: it will blindly interpret all form data as being 
iso-8859-1 (~ Cp1252), so your iso-8859-2 (~Cp1250) characters are lost. Again, I 
don't know how Tomcat 4.x line handles this. 

Being a Hungarian, I'm just as interested in entering 8859-2 characters in my pages, 
and not seeing ? marks on the server side, so I'm transcoding all form data strings on 
the fly. The off-the-wall solution looks like this:

param = new String(param.getBytes(8859_1), 8859_2);

altough this tends to be slow (running through Java char-to-byte, then through 
byte-to-char machinery). I have developed a fast 8859-1 to 8859-2 transcoder that 
addresses speed issues; contact me in private mail and I can send it to you.

Cheers,
  Attila.
--
Attila Szegedi
home: http://www.szegedi.org

- Original Message - 
From: Nikola Milutinovic [EMAIL PROTECTED]
To: Tomcat Users List [EMAIL PROTECTED]
Sent: 2002. február 18. 15:17
Subject: Re: Input from a FORM - encoding problem


  quote
  FORM attribute
  
  accept-charset = charset list [CI]
  This attribute specifies the list of character encodings for input data that 
is accepted by the server processing this form. The value is a space- and/or 
comma-delimited list of charset values. The client must interpret this list as an
  exclusive-or list, i.e., the server is able to accept any single character 
encoding per entity received.
 
 This bit is a bit unclear to me. If I specify several encodings, how will the 
browser know which one was actually used? How will the server know which one was used?
 
 Nix.
 


smime.p7s
Description: application/pkcs7-signature


RE: Input from a FORM - encoding problem

2002-02-18 Thread Arnold Shore

Re Don't bother fiddling with FORM attributes. I've done this before to
no avail:

I'm accepting Arabic, Hebrew, Russian, and Chinese doing exactly that, with
IE 6 and using Unicode encodings. (Will be trying NN and Opera shortly.) And
yes, I'm also using that encoding on the page.

It's going into a database, with subsequent retrieval and display.  Works
correctly for the stuff I've tried.

Arnold Shore
Annapolis, MD USA

-Original Message-
From: Attila Szegedi [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 18, 2002 9:39 AM
To: Tomcat Users List
Subject: Re: Input from a FORM - encoding problem


Don't bother fiddling with FORM attributes. I've done this before to no
avail.

Right now, no matter what you specify as an encoding in a HTML page, most
browsers (all favorite IE and NN flavors) ignore it altogether and encode
the form data using the encoding in which the page containing the form was
sent to them. Worse yet, they *don't* specify the encoding of characters in
the form data when sending them back via a POST request, so you must know on
the server side what was the encoding of the page that contained the form.
Servlet 2.3 spec is meant to contain a solution for this, but I don't know
how is it (or isn't) implemented in Tomcat 4.x.

As if all of the above weren't enough, Tomcat 3.x gives yet another stab to
internationalization efforts: it will blindly interpret all form data as
being iso-8859-1 (~ Cp1252), so your iso-8859-2 (~Cp1250) characters are
lost. Again, I don't know how Tomcat 4.x line handles this.

Being a Hungarian, I'm just as interested in entering 8859-2 characters in
my pages, and not seeing ? marks on the server side, so I'm transcoding all
form data strings on the fly. The off-the-wall solution looks like this:

param = new String(param.getBytes(8859_1), 8859_2);

altough this tends to be slow (running through Java char-to-byte, then
through byte-to-char machinery). I have developed a fast 8859-1 to 8859-2
transcoder that addresses speed issues; contact me in private mail and I can
send it to you.

Cheers,
  Attila.
--
Attila Szegedi
home: http://www.szegedi.org

- Original Message -
From: Nikola Milutinovic [EMAIL PROTECTED]
To: Tomcat Users List [EMAIL PROTECTED]
Sent: 2002. február 18. 15:17
Subject: Re: Input from a FORM - encoding problem


  quote
  FORM attribute
 
  accept-charset = charset list [CI]
  This attribute specifies the list of character encodings for input
data that is accepted by the server processing this form. The value is a
space- and/or comma-delimited list of charset values. The client must
interpret this list as an
  exclusive-or list, i.e., the server is able to accept any single
character encoding per entity received.

 This bit is a bit unclear to me. If I specify several encodings, how
will the browser know which one was actually used? How will the server know
which one was used?

 Nix.



--
To unsubscribe:   mailto:[EMAIL PROTECTED]
For additional commands: mailto:[EMAIL PROTECTED]
Troubles with the list: mailto:[EMAIL PROTECTED]




Re: Input from a FORM - encoding problem

2002-02-18 Thread Nikola Milutinovic

Attila Szegedi wrote:

 Don't bother fiddling with FORM attributes. I've done this before to no avail.
 
 Right now, no matter what you specify as an encoding in a HTML page, most

 browsers (all favorite IE and NN flavors) ignore it altogether and encode

 the form data using the encoding in which the page containing the form was

 sent to them. Worse yet, they *don't* specify the encoding of characters

 in the form data when sending them back via a POST request, so you must

 know on the server side what was the encoding of the page that contained

 the form. Servlet 2.3 spec is meant to contain a solution for this, but I

 don't know how is it (or isn't) implemented in Tomcat 4.x.


And how is it supposed to be specified? HTTP headers? Which ones?


 As if all of the above weren't enough, Tomcat 3.x gives yet another stab to

 internationalization efforts: it will blindly interpret all form data as

 being iso-8859-1 (~ Cp1252), so your iso-8859-2 (~Cp1250) characters are

 lost. Again, I don't know how Tomcat 4.x line handles this.


I guess I'll have to dig into the code. (sigh) Oh well, at least I HAVE access 
to the source code.


 Being a Hungarian, I'm just as interested in entering 8859-2 characters in my

 pages, and not seeing ? marks on the server side, so I'm transcoding all form

 data strings on the fly. The off-the-wall solution looks like this:
 
 param = new String(param.getBytes(8859_1), 8859_2);


Where do you place this? Is it like:

param = request.getParameter( name );
param = new String(param.getBytes(8859_1), 8859_2);

Basically, my question would be: once inside the JSP page, can I get parameters 
and re-code them some way or are they destroyed (transfigured to those pesky 
?s) upon construction oh HHTPResponse object?


 altough this tends to be slow (running through Java char-to-byte, then through

 byte-to-char machinery). I have developed a fast 8859-1 to 8859-2 transcoder

 that addresses speed issues; contact me in private mail and I can send it to you.


Sure. Send it, please.


BTW, I'm using Tomcat 4.01, so, if need be, I could employ some sort of filter,

but I'd like a proper solution. Tomcat 4 is supposed to be a reference Servlet

container, after all.


Nix.


--
To unsubscribe:   mailto:[EMAIL PROTECTED]
For additional commands: mailto:[EMAIL PROTECTED]
Troubles with the list: mailto:[EMAIL PROTECTED]




Re: Input from a FORM - encoding problem

2002-02-18 Thread Attila Szegedi

- Original Message -
From: Nikola Milutinovic [EMAIL PROTECTED]
To: Tomcat Users List [EMAIL PROTECTED]
Sent: 2002. február 18. 18:19
Subject: Re: Input from a FORM - encoding problem


 Attila Szegedi wrote:
  Don't bother fiddling with FORM attributes. I've done this before to
no avail.
  Right now, no matter what you specify as an encoding in a HTML page,
most
  browsers (all favorite IE and NN flavors) ignore it altogether and
encode
  the form data using the encoding in which the page containing the form
was
  sent to them. Worse yet, they *don't* specify the encoding of characters
  in the form data when sending them back via a POST request, so you must
  know on the server side what was the encoding of the page that contained
  the form. Servlet 2.3 spec is meant to contain a solution for this, but
I
  don't know how is it (or isn't) implemented in Tomcat 4.x.

 And how is it supposed to be specified? HTTP headers? Which ones?


request.setCharacterEncoding(String encoding)

See http://www.servlets.com/soapbox/servlet23.html (Jason Hunter's article)
for more info.

  As if all of the above weren't enough, Tomcat 3.x gives yet another stab
to
  internationalization efforts: it will blindly interpret all form data as
  being iso-8859-1 (~ Cp1252), so your iso-8859-2 (~Cp1250) characters are
  lost. Again, I don't know how Tomcat 4.x line handles this.

 I guess I'll have to dig into the code. (sigh) Oh well, at least I HAVE
access
 to the source code.


  Being a Hungarian, I'm just as interested in entering 8859-2 characters
in my
  pages, and not seeing ? marks on the server side, so I'm transcoding all
form
  data strings on the fly. The off-the-wall solution looks like this:
  param = new String(param.getBytes(8859_1), 8859_2);

 Where do you place this? Is it like:

 param = request.getParameter( name );
 param = new String(param.getBytes(8859_1), 8859_2);

 Basically, my question would be: once inside the JSP page, can I get
parameters
 and re-code them some way or are they destroyed (transfigured to those
pesky
 ?s) upon construction oh HHTPResponse object?


There's a good chance they are not destroyed. I guess question marks are the
artifact of later transformation of the string to bytes (like when
generating a response). In the request, the byte value of your characters
should be preserved and thus transcoding should be possible.


  altough this tends to be slow (running through Java char-to-byte, then
through
  byte-to-char machinery). I have developed a fast 8859-1 to 8859-2
transcoder
  that addresses speed issues; contact me in private mail and I can send
it to you.

 Sure. Send it, please.


 BTW, I'm using Tomcat 4.01, so, if need be, I could employ some sort of
filter,
 but I'd like a proper solution. Tomcat 4 is supposed to be a reference
Servlet
 container, after all.

Then try using request.setCharacterEncoding(String encoding) method before
you jump the gun and start coding the filter.

--
Attila Szegedi
home: http://www.szegedi.org



 Nix.


 --
 To unsubscribe:   mailto:[EMAIL PROTECTED]
 For additional commands: mailto:[EMAIL PROTECTED]
 Troubles with the list: mailto:[EMAIL PROTECTED]






--
To unsubscribe:   mailto:[EMAIL PROTECTED]
For additional commands: mailto:[EMAIL PROTECTED]
Troubles with the list: mailto:[EMAIL PROTECTED]




Re: Input from a FORM - encoding problem

2002-02-18 Thread Attila Szegedi

OK: he might try. I admit I've not used IE6, only IEs up to 5.5 and NN up to
4.72, but it's a fact that:

- these browsers never appended a charset declaration to the Content-Type
header (i.e. Content-Type: application/x-form-urlencoded and not
Content-Type: application/x-form-urlencoded; charset=iso-8859-2 so it was
up to the server side to figure out what the charset was.

- Tomcat 3.2.x blindly decoded form data as ISO-8859-1 (in fact, it is the
code in javax.servlet.http.HttpUtils#parsePostData() method which contains
the following much revealing comment:
quote
// XXX we shouldn't assume that the only kind of POST body
// is FORM data encoded using ASCII or ISO Latin/1 ... or
// that the body should always be treated as FORM data.

/quote
So, even if your browser acts to the spec, Tomcat 3.2.x certainly does not.
I must underline that I don't know if 3.3.x or 4.x Tomcats rely on this
(flawed) code or not. Tomcat 4.x definitely should not, since it is supposed
to implement request.setCharacterEncoding()...

Cheers,
  Attila.

--
Attila Szegedi
home: http://www.szegedi.org


- Original Message -
From: Arnold Shore [EMAIL PROTECTED]
To: Tomcat Users List [EMAIL PROTECTED]
Sent: 2002. február 18. 16:58
Subject: RE: Input from a FORM - encoding problem


 Re Don't bother fiddling with FORM attributes. I've done this before to
 no avail:

 I'm accepting Arabic, Hebrew, Russian, and Chinese doing exactly that,
with
 IE 6 and using Unicode encodings. (Will be trying NN and Opera shortly.)
And
 yes, I'm also using that encoding on the page.

 It's going into a database, with subsequent retrieval and display.  Works
 correctly for the stuff I've tried.

 Arnold Shore
 Annapolis, MD USA

 -Original Message-
 From: Attila Szegedi [mailto:[EMAIL PROTECTED]]
 Sent: Monday, February 18, 2002 9:39 AM
 To: Tomcat Users List
 Subject: Re: Input from a FORM - encoding problem


 Don't bother fiddling with FORM attributes. I've done this before to no
 avail.

 Right now, no matter what you specify as an encoding in a HTML page, most
 browsers (all favorite IE and NN flavors) ignore it altogether and encode
 the form data using the encoding in which the page containing the form was
 sent to them. Worse yet, they *don't* specify the encoding of characters
in
 the form data when sending them back via a POST request, so you must know
on
 the server side what was the encoding of the page that contained the form.
 Servlet 2.3 spec is meant to contain a solution for this, but I don't know
 how is it (or isn't) implemented in Tomcat 4.x.

 As if all of the above weren't enough, Tomcat 3.x gives yet another stab
to
 internationalization efforts: it will blindly interpret all form data as
 being iso-8859-1 (~ Cp1252), so your iso-8859-2 (~Cp1250) characters are
 lost. Again, I don't know how Tomcat 4.x line handles this.

 Being a Hungarian, I'm just as interested in entering 8859-2 characters in
 my pages, and not seeing ? marks on the server side, so I'm transcoding
all
 form data strings on the fly. The off-the-wall solution looks like this:

 param = new String(param.getBytes(8859_1), 8859_2);

 altough this tends to be slow (running through Java char-to-byte, then
 through byte-to-char machinery). I have developed a fast 8859-1 to 8859-2
 transcoder that addresses speed issues; contact me in private mail and I
can
 send it to you.

 Cheers,
   Attila.
 --
 Attila Szegedi
 home: http://www.szegedi.org

 - Original Message -
 From: Nikola Milutinovic [EMAIL PROTECTED]
 To: Tomcat Users List [EMAIL PROTECTED]
 Sent: 2002. február 18. 15:17
 Subject: Re: Input from a FORM - encoding problem


   quote
   FORM attribute
  
   accept-charset = charset list [CI]
   This attribute specifies the list of character encodings for input
 data that is accepted by the server processing this form. The value is a
 space- and/or comma-delimited list of charset values. The client must
 interpret this list as an
   exclusive-or list, i.e., the server is able to accept any single
 character encoding per entity received.
 
  This bit is a bit unclear to me. If I specify several encodings, how
 will the browser know which one was actually used? How will the server
know
 which one was used?
 
  Nix.
 


 --
 To unsubscribe:   mailto:[EMAIL PROTECTED]
 For additional commands: mailto:[EMAIL PROTECTED]
 Troubles with the list: mailto:[EMAIL PROTECTED]






--
To unsubscribe:   mailto:[EMAIL PROTECTED]
For additional commands: mailto:[EMAIL PROTECTED]
Troubles with the list: mailto:[EMAIL PROTECTED]