Re: Character encoding problems using jsp:include with jsp:param in Tomcat 8.5 only.

2018-11-29 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thorsten,

On 11/27/18 04:48, Thorsten Schöning wrote:
> Guten Tag Christopher Schultz, am Montag, 26. November 2018 um
> 16:07 schrieben Sie:
> 
>> web.xml - ---  
>> UTF-8 
>> 
> 
> Tested that with Tomcat 9 and this setting fixed my problem the
> same as using SetCharacterEncodingFilter. It doesn't work in Tomcat
> 8.5, I guess because that simply doesn't implement Servlet 4.0?

Correct. Tomcat 8.0 and 8.5 implement servlet 3.1. In Tomcat 8.x,
you'll need to use the SetCharacterEncodingFilter.

> Because I still need to support Tomcat 7 and 8.0 for some time,
> I'll keep SetCharacterEncodingFilter for now and just document the
> better solution. Thanks!

Sounds good. The SetCharacterEncodingFilter should be entirely
forward-compatible.

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlwAMN8ACgkQHPApP6U8
pFgY/w/+JyJy02PVIebDXUNYugq8rR2GR+7cQhrHiFwdR0kcf8/FySP8s/8IsJyn
JaCbQ4V/qssMRYlSaxHb2m7xpioraXJkXQE/3HGZyJFKnLykZcAwF86jTSuTesS0
I20IRMh5KJKMoCszmDfqMnY3vQSGJJ7G+Jc47myApKn7qu2igQcDHkVZSK7hEqsb
+ayfHiUIkyN24h6xvFEb7u5RDiATMli6GOverpW1t5+oWdDoUK452aQGQYfN8ojH
Nv2lI6r9OSKQoz3eA6xNkMLlfSPGCH1kzfDyY4KYqhBtxshTnxRzkEoZ3w+DjVjD
U69oOpLthm7nTiYbdGft4dMTcKW+17LczjEbRExV8ZqM3EI92a2iTPDhrva5T65E
dTcNuImv2dr9Ijgn6hvMttE1Ntubncy+UwRdfuGTAoeZ771zxrP7+6UN6BXyO14S
rwgAI1tPzwwsWHJ4emfNEERjKbKy0m5U/WivoKmVVDavGfYskCWQXkzZ64eUGxuU
QKANPJJcprELYw2bX06n+ViJ+zKRHju4SsdJuScKpiXsBgVqiE6MsilB5DKIO8vg
zypgshIpoKVjq3KevsEyHUbVNZguxv4wtSOsGhjkYpm0+e07e/MNLXaK2OnLxIV5
0OGfimo2pYNocS2iM2a2aiwi5PMfDchqjjVovyQvFSV4W3xaMIk=
=mqmG
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Character encoding problems using jsp:include with jsp:param in Tomcat 8.5 only.

2018-11-27 Thread Thorsten Schöning
Guten Tag Christopher Schultz,
am Montag, 26. November 2018 um 16:07 schrieben Sie:

> web.xml
> - ---
> 
>   UTF-8
> 

Tested that with Tomcat 9 and this setting fixed my problem the same
as using SetCharacterEncodingFilter. It doesn't work in Tomcat 8.5, I
guess because that simply doesn't implement Servlet 4.0?

Because I still need to support Tomcat 7 and 8.0 for some time, I'll
keep SetCharacterEncodingFilter for now and just document the better
solution. Thanks!

P.S.:

I've send you a private mail some days ago, unrelated to Tomcat. Did
you get that? Just want to make sure that I'm not spam filtered.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning   E-Mail: thorsten.schoen...@am-soft.de
AM-SoFT IT-Systeme  http://www.AM-SoFT.de/

Telefon...05151-  9468- 55
Fax...05151-  9468- 88
Mobil..0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Character encoding problems using jsp:include with jsp:param in Tomcat 8.5 only.

2018-11-26 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thorsten,

On 11/26/18 08:45, Thorsten Schöning wrote:
> Hi all,
> 
> I'm currently testing migration of a legacy web app from Tomcat 7
> to 8 to 8.5 and ran into problems regarding character encoding in
> 8.5 only. That app uses JSP pages and declares all of those to be
> stored in UTF-8, does really do so :-), and declares a HTTP-Content
> type of "text/html; charset=UTF-8" as well. Textual content at
> HTML-level is properly encoded using UTF-8 and looks properly in
> the browser etc.
> 
> In Tomcat 8.5 the following is introducing encoding problems,
> though:
> 
>>  > name="chooseSearchInputTitle" value="Benutzer wählen" /> 
>> 
> 
> "search.jsp" simply outputs the value of the param as the "title" 
> attribute of some HTML-link and the character "ä" is replaced 
> somewhere with the Unicode character REPLACEMENT CHARACTER 0xFFFD.
> But really only in Tomcat 8.5, not in 8 and not in 7.

Have you been able to determine if the problem is on input or output?

> I can fix that problem using either "SetCharacterEncodingFilter"
> or the following line, which simply results in the same I guess:
> 
>> <% request.setCharacterEncoding("UTF-8"); %>

FYI the SetCharacterEncodingFilter only modifies request encoding and
not response encoding. Also, it only changes the encoding of the
request *body* (e.g. PUT/POST), and not the encoding used to decode
the URI. That's configured in 's URIEncoding. There is also
useBodyEncodingForURI which inherits the request body's encoding if
it's present. I recommend using useBodyEncodingForURI="true".

I recommend *always* using SetCharacterEncodingFilter, since web
browsers both habitually refuse to send a correct content/type and
often use UTF-8 in URLs in violation of the HTTP spec. The result is
essentially that everything works the way you *want* it to work,
except that you just have to "hope" it works instead of being able to
prove that it will.

> Looking at the generated Java code for the JSP I get the
> following:
> 
>> org.apache.jasper.runtime.JspRuntimeLibrary.include(request,
>> response, "/WEB-INF/jsp/includes/search.jsp" + "?" +
>> org.apache.jasper.runtime.JspRuntimeLibrary.URLEncode("chooseSearchIn
putTitle",
>> request.getCharacterEncoding())+ "=" +
>> org.apache.jasper.runtime.JspRuntimeLibrary.URLEncode("Benutzer
>> wählen", request.getCharacterEncoding()), out, false);
> 
> The "ä" is properly encoded using UTF-8 in all versions of Tomcat
> and the generated code seems to be the same in all versions as
> well, especially regarding "request.getCharacterEncoding()".
> 
> "getCharacterEncoding" in Tomcat 8.8 has changed, the former 
> implementation didn't take the context into account:
> 
>> @Override public String getCharacterEncoding() { String
>> characterEncoding = coyoteRequest.getCharacterEncoding(); if
>> (characterEncoding != null) { return characterEncoding; }
>> 
>> Context context = getContext(); if (context != null) { return
>> context.getRequestCharacterEncoding(); }
>> 
>> return null; }

This is just a fall-back for when there is no character encoding
defined in the request (because the browser didn't send one).

> My connector in server.xml is configured to use "URIEncoding" as
> UTF-8 in all versions of Tomcat, but that doesn't make a difference
> to 8.5. So I understand that using "setCharacterEncoding", I set
> the value actually used in the generated Java now, even though the
> following is documented for character encoding filter:
> 
>> Note that the encoding for GET requests is not set here, but on a
>> Connector
> 
> https://tomcat.apache.org/tomcat-8.5-doc/config/filter.html#Set_Charac
ter_Encoding_Filter/Introduction
>
>  Now I'm wondering about multiple things...
> 
> 1. Doesn't "getCharacterEncoding" provide the encoding of the 
> HTTP-body?

Yes, but it comes directly from the browser, who often doesn't provide
it. There is no encoding-detection going on, so it's often "null" or
ISO-8859-1, which is the spec-defined default.

> My JSP is called using GET and the Java quoted above seems to build
> a query string as well. So why does it depend on some body encoding
> instead of e.g. URIEncoding of the connector?

Good question. Might be  a bug, here.

> 2. Is my former approach wrong or did changes in Tomcat 8.5
> introduce some regression? There is some conversion somewhere which
> was not present in the past.

Tomcat 8.5 follows the servlet spec, which in v4.0 added the
 to make things even more fun.
Actually, this can replace the use of the SetCharacterEncodingFilter.
Thanks for pointing this out; I wasn't aware of this feature of the
4.0 spec.

> 3. What is the correct fix I need now? The character encoding
> filter, even though it only applies to bodies per documentation?

Try setting  in your  like this:

web.xml
- ---

  UTF-8


- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/


Re: Character encoding issue in URL

2017-01-25 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Justin,

On 1/25/17 12:25 AM, Justin Dang wrote:
> Hi, I have a clean install of an older version of Tomcat (8.0.24).
> I have noticed when a character is encoded in the URL, Tomcat fails
> to return the URL requested.  I've noted this same request
> performed in IIS works fine.
> 
> Apache Tomcat tests:
> 
> Works (No escape) –  http://localhost:8080/examples/delme/íj.pdf
> 
> Works (URL encoded) –
> http://localhost:8080/examples/delme/%C3%ADj.pdf
> 
> Failed (Char encoded) –
> http://localhost:8080/examples/delme/%EDj.pdf

You are mistaken. While you might think that %ED would map to U+00ED,
it does not. You need to use %C3%AD as you have done in your second
example, because the standard is UTF-8 and not UTF-16 or anything like
that.

%ED is not a valid character in UTF-8.

https://en.wikipedia.org/wiki/UTF-8#Description

> IIS tests:
> 
> Works (No escape)  –  http://localhost:8080/examples/delme/íj.pdf
> 
> Works (URL encoded) –
> http://localhost:8080/examples/delme/%C3%ADj.pdf
> 
> Works (Char encoded) –
> http://localhost:8080/examples/delme/%EDj.pdf
> 
> 
> I've reviewed this wiki page:
> 
> 
> https://wiki.apache.org/tomcat/FAQ/CharacterEncoding
> 
> 
> And it seems to imply that I shouldn't have to do anything, and the
> URL request should return properly.
> 
> 
> So my question is, what do I need to configure in Apache Tomcat to
> handle the character encoding request like IIS does?

Only by violating UTF-8.

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJYiNiJAAoJEBzwKT+lPKRY0I4P/2B84M/wkoomYdL3T2mgG7Pg
MjijT7cRrAnX/OhJsT1vILKFVeW8nB6O6IV2NDUx4CtqcVg/ce4cYPmoy0qADMyu
qHmwybGMauoIM6uamA1jxDiNWGElW36Wa6y8ESySFG0qzsK8o++XJMCINlS2hQJ9
g7dBcfVLXQc9PTYIGbrAQQ/oSVViRRgfsW5TgH0YlVfie1iSASRm9lcYLHliDGH9
S3NMPdmaRE+lwkrKJ1X6r+Kxz95e5hxQWQPXc4xGGcmZEC8PWcnQRiCob/TCqJUh
obKNrLEC/GvJ8gu7eCEFMDd6usjUIxJVjhGJDPo0vxVcLIJ9dte2kq714u12w7kl
49AMoyz+3Co5W5PheeqQnIoJhA5sqJRP3KxuxcfTJE7TyKn+SE2moC0twDKpur5W
exu5ps2wdaBmIBE3S5aXxGYpFmlm5dvdcM1lQjoiIdg5JdKZLacxP7DBCaVT8UC9
4/Siu1iDBz0KnEwCoBhFjlr8qVoSgCfRV6VEHjhr9z+yEG60cnniVk2diYdpcpia
W/iPEe7nFhzBjNelqh1IL9XlogTc4IIoL0T88ti5EYks/pKgr4Ilsh08IkJhtHk6
vH3jCmdbR3c3Gb002lOMk9oBYyvOSxnwUr34n7KXcEYitJd8a8YNm+tNsKQ14ZLS
1z8g/1zJZSrGZdX6n8g9
=4ASz
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Character encoding issues

2016-08-24 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

James,

On 8/24/16 3:46 PM, James H. H. Lampert wrote:
> On 8/24/16, 12:36 PM, Mark Thomas wrote:
> 
>> At a guess, something in the web application is using the
>> platform default encoding rather than an explicit encoding. Given
>> that the Linux box is OK, it looks like the app should be
>> explicitly using UTF-8 everywhere.
> 
> Based on a response I got on the Midrange Java List, and on what
> I'd found since I entered the query, I would agree.
> 
> What's the best way to accomplish this?

http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8

If you also want to change the default platform encoding, you'll have
to do the following (or equivalent on your E4A):

1. Edit CATALINA_BASE/bin/setenv.sh (or create it)
2. Add this line:

   export CATALINA_OPTS="$CATALINA_OPTS -Dfile.encoding=UTF-8"

The above should change the values on the E4A's JVM so that everything
reports UTF-8 and nothing reports ISO-8859-1. The "UnicodeBig" is the
internal encoding that the JVM uses to represent Java primitive "char"
type and will probably only work with the value you've got there...
don't try to mess with that! :)

FWIW, the system properties of each JVM are somewhat interesting, but
probably won't help you debug anything. It might not even fix anything.

Every HTTP request/response is defined to have a character encoding:

1. As specified in the Content-Type header
2. To be ISO-8859-1 as the default in case no header exists

"Most" client software these days actually defaults to sending
requests without a Content-Type character encoding and instead just
using  "whatever character encoding the server sent this page to me
using" as the character encoding. That's usually not the case with
back-end software, which is likely the case with your app.

Presumably, your application uses IMAP or similar to contact the gmail
server? In that case, HTTP isn't in use and it's possible that the
system properties defining the system character encoding are in use.
It all depends upon how the software works under the hood.

If HTTP is in use, here, then the problem exists in some component not
following the spec. Tomcat isn't part of the the problem, there. If
some other protocol is in use, it's entirely possible that default
"platform" (as defined by system properties) encoding is being used.
The only solution there would be to change the file.encoding property
as I've described above.

Let us know how it goes.

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJXvhPQAAoJEBzwKT+lPKRY97kQAKPNSiS+6ZHWd/XvfgQMsGMB
zF/+gD75swub3qjz93vYYwG/OCykNE7ljpZJva3VNXdHwfnPkPe9xL4Kbn5uH/0i
CA4zztQW89Mkdhe9tGa4LDCv4tDWQGLhvKiGu3moFzcjKMSHTyIQe6wByX4SUk5N
HCbHf39avr6So60G55i7vWBPkeU9Du8Oa0T/X3NOlAlBIoSiDm2HKdvwb+3Fmeqn
W7JytdcRxoS5VwkIJFa3lsFt77Rz3ROV7KnCl7wrCVaxPs0RIM7DI9ojzzbMLt2s
S+nArR4gKwR0A5js+nRGJ/H53m1qiqUGvpb6HmUUz2pVSpTejGQFwVANwf54+IUY
uQKxud5XkB+JDN4f7+7ZKUn2l4kgrtYJxxyr2bzzYmHu3Z0AAMAqt78ZI7DYdCBZ
B0Gpdx6DPV0czsQs4g/usmF3M3hbAhkozYi7U5tzZfmUg2rIBfHKo4bX0GMEznJ9
5HvVJpRyLUPnXkA85wPi3aJwuvavFb9r51Kg17Vhuj74qcEacH4RwydE2vPRmVm1
WovpPjP0rwIpmJJlYq+RzzSXkYShiOZftqOKOeH/XSO+IwpQS2MlYpFUNiLpU4Y7
7qhatQQMcbmBHEFJ7jI1gJs/jkChm3iUWicOwju0XWoTshg0wEA3tEhGgZe8laN9
kLp4YiKoxMDPDX5uFQvO
=uvDy
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Character encoding issues

2016-08-24 Thread James H. H. Lampert

On 8/24/16, 12:36 PM, Mark Thomas wrote:


At a guess, something in the web application is using the platform
default encoding rather than an explicit encoding. Given that the Linux
box is OK, it looks like the app should be explicitly using UTF-8
everywhere.


Based on a response I got on the Midrange Java List, and on what I'd 
found since I entered the query, I would agree.


What's the best way to accomplish this?

--
JHHL

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Character encoding issues

2016-08-24 Thread Mark Thomas
On 24/08/2016 17:43, James H. H. Lampert wrote:
> Ladies and Gentlemen of the Tomcat and Midrange-Java communities:
> 
> We're having a weird problem with character encoding in a Tomcat webapp.
> 
> We've added an interface to GMail to our webapp, and we've got, just for
> our own development, testing, and production use, instances of that
> webapp running in three different Tomcat 7 servers: one running on an
> IBM Midrange box (an E4A, running V6R1, with Tomcat running on a Java 6
> JVM), one running on a Windows box, and the third running on a Linux
> (CentOS) box.
> 
> On the Midrange box, the traffic between us and GMail is getting garbled
> (Chinese characters appear, seemingly at random), with an apparent
> character encoding conflict. On the Linux box, it isn't. Not sure about
> the Windows box.
> 
> Now, on the Midrange box, it's a fairly straightforward process to look
> at the Java System Properties for a JVM. For the JVM Tomcat is running
> in, "Initial Java System Properties" shows
>> file.encoding   'ISO8859_1'
> and "Current Java System Properties" shows
>> os.encoding 'ISO8859-1'
>> sun.jnu.encoding'ISO8859-1'
>> sun.io.unicode.encoding 'UnicodeBig'
>> ibm.system.encoding 'ISO8859-1'
>> file.encoding   'ISO8859_1'
> 
> I found JConsole and JVisualVM on the Linux box, and while I couldn't
> find system properties in JConsole, I could in JVisualVM. I have:
>> file.encoding=UTF-8
>> sun.jnu.encoding=UTF-8
>> sun.io.unicode.encoding=UnicodeLittle
> 
> Can somebody enlighten me on whether this is the cause of the encoding
> issue with Google, and what to do about it?

At a guess, something in the web application is using the platform
default encoding rather than an explicit encoding. Given that the Linux
box is OK, it looks like the app should be explicitly using UTF-8
everywhere.

Mark


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Character encoding question

2010-08-29 Thread Dave Cherkassky

The Microsoft characters are encoded in CP-1252 
(http://en.wikipedia.org/wiki/Windows-1252).

However, if the problem is the database-driven content, then you also need to consider 
the encoding that the dB is using.  For example, MySQL might by default use latin1 
(ISO-8859-1), so your data might be corrupted before it is even seen by 
Tomcat.   So be careful -- you might have to go deeper than just the Tomcat encoding.


BTW, in our application we solved the same problem by catching the data before 
it is saved to the dB, converting any CP-1252 characters into reasonable 
latin-1 characters.  It is not the perfect solution, but is enough for our 
clients.

Good luck,
--
Dave Cherkassky
  VP of Software Development
  DJiNN Software Inc.
  416.504.1354

On 27/08/2010 1:23 PM, laredotornado wrote:


Hi,

I'm using Tomcat 6.0.26.  I'm noticing that when our JSPs pages are served,
we frequently have ?s where apostrophes should be.  We think this is
because the database-driven content contains the Microsoft style apostrophe.

My question is, if I adjust the character encoding on Tomcat, will it serve
the MS character instead of a question mark?  I read the default encoding is
ISO-8859-1, which I thought would include this mystery character, but
apparently it doesn't.  Do you know what encoding I should use and where I
should set it?

Thanks, - Dave


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Character encoding question

2010-08-27 Thread Pid
On 27/08/2010 18:23, laredotornado wrote:
 
 Hi,
 
 I'm using Tomcat 6.0.26.  I'm noticing that when our JSPs pages are served,
 we frequently have ?s where apostrophes should be.  We think this is
 because the database-driven content contains the Microsoft style apostrophe.  

[wince]

 My question is, if I adjust the character encoding on Tomcat, will it serve
 the MS character instead of a question mark?  I read the default encoding is
 ISO-8859-1, which I thought would include this mystery character, but
 apparently it doesn't.  Do you know what encoding I should use and where I
 should set it?

Depends.  What encoding does the DB use?  What kind of DB is it?


p


0x62590808.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature


Re: Character encoding for POST x-www-form-urlencoding (a success story)

2010-02-12 Thread Xie Xiaodong
Very nice work, Thank you for the sharing.



On Fri, Feb 12, 2010 at 11:23 PM, Christopher Schultz 
ch...@christopherschultz.net wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 All,

 My company recently decided to alter our password complexity
 requirements for our webapp, and I got to implement the changes. What fun!

 We use a regular expression to enforce our password complexity, and it
 needed to be changed. Since we are starting to branch-out into
 populations that aren't necessarily using written English everywhere, I
 chose to change our naive [a-z]- and [A-Z]-type checking to a mroe
 enlightened \p{Ll} and \p{Lu}, respectively. (Readers' note: jakarta-oro
 does not support this notation, so you'll want to use Java's built-in
 regular expression support to do this).

 Anyhow, when making changes to things security-related, it pays to test
 /everything/, so I grabbed 4 other people from my group and had them
 each test 15 sample passwords against our 6 different forms that accept
 password-change entry. Everything went fine.

 Except when I then tried to login from our home page with the password
 1πππ (that's a '1' digit followed by 7 Greek Pi characters, in
 case your email reader can't render that), and I got a failure. I
 figured I must have fat-fingered something, so I tried again and all was
 well.

 My spidey-sense tingling, I logged-out and repeated the process: again,
 my first login attempt was unsuccessful, while the second was. Hmm. Upon
 closer inspection, our opening page is a static HTML file served by
 Apache httpd -- no Tomcat involvement. After a failed login, a page that
 looks exactly like the home page is sent to the user, but it's
 different: /and/ it's served by Tomcat.

 The difference was that the original request's response (for
 /index.html) had a Content-Type of text/html, while the failed login
 had a response Content-Type of text/html; charset=UTF-8.

 It's out old pal what's the default encoding, again? coming back to
 haunt me, and here I am telling people on this list that they just don't
 understand the history of the web and how to do things properly.
 Evidently, I wasn't doing them properly, either.

 All those complaints about the way that URL-encoded GET parameters can
 get messed up based upon Content-Type and encoding guesses, etc. and the
 solution is just to use POST is, well, only half the truth. Yes, POST
 gets you away from the browser's preference for what encoding to use
 before URL-encoding the bytes, but, with POST the Content-Type is
 application/x-www-form-urlencoded, which means there's no charset
 associated with it. :(

 So, what's to be done?

 Well, I immediately thought of two solutions:

 meta http-equiv=Content-Type content=text/html; charset=UTF-8 /
 and
 form accept-charset=UTF-8

 Knowing that web browsers are notoriously inconsistent with one another
 regarding certain things, I was sure that I'd have a giant mess when it
 came to testing, and that I'd have to figure out how to trick each
 version of each browser into doing my bidding.

 First, I had to make sure that they all /failed/ in the same way (that
 is to say, that the login failed the way I expected it to fail), then I
 had to see what magical incantations would be necessary to actually get
 the login to succeed.

 I'm happy to report that, for /all/ of the following browsers, */both/*
 solutions worked!

 Mozilla Firefox 2.0
 Mozilla Firefox 3.0
 Mozilla Firefox 3.5
 Mozilla Firefox 3.6
 Opera 9.6
 Opera 10.10
 Apple Safari 3.2
 Apple Safari 4.0
 Google Chrome 4.0
 MSIE 6.0
 MSIE 7.0
 MSIE 8.0

 I'm inclined to use the form accept-charset=UTF-8 solution, because
 that does not involve lying to the browser about the encoding of the
 actual HTML document. Instead, I'd rather advertise that I will only
 accept UTF-8 encoding and leave it at that. Sadly, the client still
 doesn't tell me that the underlying encoding being used to urlencode the
 POST parameters is UTF-8, but at least they're doing what I want them to
 do, and they all agree on behavior!

 So, score 1 for standards, at least in this instance.

 - -chris
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (MingW32)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

 iEYEARECAAYFAkt11PoACgkQ9CaO5/Lv0PC+OACgtobt70NWFxYJzcRt5r0zXlaN
 tYEAn0ZYnB/oehIoZR0NUs7Q/4mOux7x
 =U0Wt
 -END PGP SIGNATURE-

 -
 To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
 For additional commands, e-mail: users-h...@tomcat.apache.org




-- 
Sincerely yours and Best Regards,
Xie Xiaodong


Re: Character encoding

2008-06-19 Thread nch

Chris, I finally found it.
My server.xml was not correctly configured. My fault.

Again, thank you all for your help.



- Original Message 
From: Christopher Schultz [EMAIL PROTECTED]
To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 11:12:45 PM
Subject: Re: Character encoding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| You say:
| Tomcat does not use any environment variables. The only settings that
| affect the interpretation of the URI are the URIEncoding and
| useBody... settings on the Connector. Are you using more than one
| connector? Are you using Apache httpd out in front of Tomcat?
|
| Perhaps the JVM does and so tomcat read them indirectly through it??

You can read the code for the connector. Those settings are the only
relevant ones.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZek0ACgkQ9CaO5/Lv0PBDvQCguIgu+QMTjKDxua3CS0cn9Gd0
AEoAoIZTNaJpiI8Xv3szp9O+3eANIGK0
=+VmT
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

Re: Character encoding

2008-06-19 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| Chris, I finally found it.
| My server.xml was not correctly configured. My fault.
|
| Again, thank you all for your help.

No problem. Would you mind explaining for the group what the actual
problem was, and what the solution was?

Lots of these threads go nowhere because either the people asking
questions go away entirely, or they say works, now! and nobody reading
the archives has any clue where they should look (in spite of the
repeated answers they get from folks like me).

Thanks,
- -chris

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhau0MACgkQ9CaO5/Lv0PBp7gCeLf+c+fGjkNzGO1qqQvazol4f
buwAnRbiYnDWcubbAu0AnnQ21SClNAVm
=z0rX
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| I'm having difficulties trying to decode URI parameters into UTF8.

:(

| When I moved the application
| to linux (debian etch) I found out it was not working.

We run on Linux as well. TC 5.5.23, Java 1.5.0_11. We have configured
the following:

1. Set URIEncoding=UTF-8 on our Connector
~   (but /not/ useBodyEncoding)

2. Installed a filter similar to the one you mentioned

3. Output encoding on every page is set to UTF-8

This appears to work with us (we tried several greek characters and they
went into our database and came back out correctly).

Try removing the useBodyEncoding setting.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZE/0ACgkQ9CaO5/Lv0PCMeACfbsGgANVvy3xTBY3sqiQN5STW
6I0AniwfnPX0OTPNmQ7YJGc+c/YL2AJx
=ruy+
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread nch

Thanks, Christopher.
This doesn't work either.

I removed the useBodyEncoding property, as you suggested, from the Connector 
element, but the URI parameter coming in the request is still being decoded 
into ISO-8859-1 instead of UTF-8. Pages are displaying correctly, I use 
pageEncoding=UTF-8 contentType=text/html;charset=utf-8 in every single page
I also tried changing my system locale into es_ES.UTF-8 (it was en_US-UTF-8) by 
following http://people.debian.org/~schultmc/locales.html , but I can see no 
difference after restarting everything.
Remember, I'm having this problem in debian etch (works fine in windows xp).

Many thanks.


- Original Message 
From: Christopher Schultz [EMAIL PROTECTED]
To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 3:56:13 PM
Subject: Re: Character encoding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| I'm having difficulties trying to decode URI parameters into UTF8.

:(

| When I moved the application
| to linux (debian etch) I found out it was not working.

We run on Linux as well. TC 5.5.23, Java 1.5.0_11. We have configured
the following:

1. Set URIEncoding=UTF-8 on our Connector
~   (but /not/ useBodyEncoding)

2. Installed a filter similar to the one you mentioned

3. Output encoding on every page is set to UTF-8

This appears to work with us (we tried several greek characters and they
went into our database and came back out correctly).

Try removing the useBodyEncoding setting.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZE/0ACgkQ9CaO5/Lv0PCMeACfbsGgANVvy3xTBY3sqiQN5STW
6I0AniwfnPX0OTPNmQ7YJGc+c/YL2AJx
=ruy+
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

Re: Character encoding

2008-06-18 Thread André Warnier



nch wrote:

Thanks, Christopher.
This doesn't work either.


Could you give an example of such a UTF-8 encoded URI ?
(and tell us what it should be decoded to)
Thanks


-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread nch
There it goes.
I have a form that has an input field named query. I type piraña an submit 
the form using the GET method.
I can see the browser has encoded this parameter into the URI as 
query=pira%C3%B1a
I set a breakpoint into the filter so when the request hits the filter I can 
see getCharacterEncoding() returns null. The filters sets it to UTF-8.
Then the request gets to the controller where I can see the request parameter 
query is set to piraña. 
The controller tries to perform a text search using that query but, obviously, 
it doesn't return any results. I can manually modify it while debugging and set 
it to piraña, so the controller returns several results.
BTW. I'm running Tomcat 6.0.13 on Sun JDK 1.6.0_06

Kind regards.


- Original Message 
From: André Warnier [EMAIL PROTECTED]
To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 4:29:54 PM
Subject: Re: Character encoding



nch wrote:
 Thanks, Christopher.
 This doesn't work either.
 
Could you give an example of such a UTF-8 encoded URI ?
(and tell us what it should be decoded to)
Thanks


-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

Re: Character encoding

2008-06-18 Thread Johnny Kewl


- Original Message - 
From: André Warnier [EMAIL PROTECTED]



Could you give an example of such a UTF-8 encoded URI ?
(and tell us what it should be decoded to)
Thanks


Andre have a look here... its not url encoding, thats something different
It about being able to store japanese and typically trying to match it all 
with the dB's encoding.


Heres some1's explanation of encoding history and why UTF8 is a good 
thing...


http://www.joelonsoftware.com/articles/Unicode.html

And here is a typical solution on TC's wiki

http://wiki.apache.org/tomcat/Tomcat/UTF-8

And in the real world it gets hectic ;)

Like in Netbeans if you dont put this in Opts
Dfile.encoding=UTF-8
You not seeing Japanese in your editor... and it wont save the files as UTF 
8


Then you think cool... until you find out you can stand on your head and a 
property file will not encode in UTF 8...


Then you may have some lib that converts back to ASCII and you cant figure 
it out...
Its a headache... but necessary... Java actually does a pretty good job of 
things in String just by default, but if you look at all the options you 
going to find the whole encoding thing going on there as well.


Then you try just the %@ page contentType and its perfect, next project for 
some unknown reason you got to do the old fashioned meta tag in the web 
pages as well


 fun stuff ;) 



-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread Johnny Kewl


- Original Message - 
From: nch [EMAIL PROTECTED]

To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 5:09 PM
Subject: Re: Character encoding


There it goes.
I have a form that has an input field named query. I type piraña an 
submit the form using the GET method.
I can see the browser has encoded this parameter into the URI as 
query=pira%C3%B1a
I set a breakpoint into the filter so when the request hits the filter I can 
see getCharacterEncoding() returns null. The filters sets it to UTF-8.
Then the request gets to the controller where I can see the request 
parameter query is set to piraña.
The controller tries to perform a text search using that query but, 
obviously, it doesn't return any results. I can manually modify it while 
debugging and set it to piraña, so the controller returns several results.

BTW. I'm running Tomcat 6.0.13 on Sun JDK 1.6.0_06

Kind regards.

nch, I think the HTML page doesnt know its charset... it doesnt look like 
its encoded.

Have a look at this article... they doing almost what you doing
http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/

I think you are in your software bringing the req back UTF8 encoded but 
the page that went out to the browser is not telling the browser the form 
must come back encoded.
It looks just like normal URL encoding there is not UTF8 in there... I 
think.


Good luck... 



-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread Johnny Kewl


- Original Message - 
From: Johnny Kewl [EMAIL PROTECTED]

To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 6:26 PM
Subject: Re: Character encoding




- Original Message - 
From: nch [EMAIL PROTECTED]

To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 5:09 PM
Subject: Re: Character encoding


There it goes.
I have a form that has an input field named query. I type piraña an 
submit the form using the GET method.
I can see the browser has encoded this parameter into the URI as 
query=pira%C3%B1a
I set a breakpoint into the filter so when the request hits the filter I 
can see getCharacterEncoding() returns null. The filters sets it to 
UTF-8.
Then the request gets to the controller where I can see the request 
parameter query is set to piraña.
The controller tries to perform a text search using that query but, 
obviously, it doesn't return any results. I can manually modify it while 
debugging and set it to piraña, so the controller returns several 
results.

BTW. I'm running Tomcat 6.0.13 on Sun JDK 1.6.0_06

Kind regards.

nch, I think the HTML page doesnt know its charset... it doesnt look like 
its encoded.

Have a look at this article... they doing almost what you doing
http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/

I think you are in your software bringing the req back UTF8 encoded 
but the page that went out to the browser is not telling the browser the 
form must come back encoded.
It looks just like normal URL encoding there is not UTF8 in there... I 
think.


nch I checked it... I was wrong, the browser is returning the right 
things... that is UTF8
but that display of piraña is still ISO... ie ISO trying to display the 
UTF8

So its been read wrong in the server.. sorry.
If the IDE is not set up for UTF8... then the display is right, NB just cant 
show it to you until it can also read UTF8... good luck ;)  Maybe its just 
your eyes that are broken, and TC is working ;)
Send it back to a the browser... it will probably be right... in which case 
its the IDE ;)


Good luck ;) 



-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| This doesn't work either.

:(

| I removed the useBodyEncoding property, as you suggested, from the
| Connector element, but the URI parameter coming in the request is
| still being decoded into ISO-8859-1 instead of UTF-8.

How do you know that ISO-8859-1 is being used to decode it?

| Pages are
| displaying correctly, I use pageEncoding=UTF-8
| contentType=text/html;charset=utf-8 in every single page I also
| tried changing my system locale into es_ES.UTF-8 (it was en_US-UTF-8)
| by following http://people.debian.org/~schultmc/locales.html , but I
| can see no difference after restarting everything. Remember, I'm
| having this problem in debian etch (works fine in windows xp).

We don't bother explicitly setting the JVM's locale or anything like
that. The standard environment for my production system shows
file.encoding=UTF-8 with no additional configuration. That should not
affect the interpretation of URI parameters, though.

Are you sure that your configuration is being read properly? That
everything is spelled correctly? That you are actually putting
server.xml in the right place and that TC is properly reading it? I'm
asking because the change you made should definitely have worked.

- -chris

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZRBsACgkQ9CaO5/Lv0PBYHQCcCJzA1/JhwDD9XtWG4ilBK7Z5
/IoAoLsCGbi+Vw6jA/Ycc0elpb9tZrlN
=7mCa
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| I have a form that has an input field named query. I type piraña
| an submit the form using the GET method. I can see the browser has
| encoded this parameter into the URI as query=pira%C3%B1a

Is this a correct UTF-8 encoding of the parameter? I don't have my
unicode conversion chart handy right now.

| I set a breakpoint

Stop right there. If you are executing TC through a debugger, are you
sure that it is using its standard server.xml configuration?

| into the filter so when the request hits the filter I can see
| getCharacterEncoding() returns null. The filters sets it to UTF-8.

FYI, this has no bearing on the interpretation of the URI.

| Then the request gets to the controller where I can see the request
| parameter query is set to piraña.

Just in case it doesn't go through email very well, I see pir followed
by an A with a tilde over it, followed by a +/- symbol, followed by an
a. Definitely not right. Is that what you'd expect if you improperly
interpreted the UTF-8, URL-encoded piraña as if it were ISO-8859-1?

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZRO8ACgkQ9CaO5/Lv0PBXBQCeP3YKqnpJDO65N8lfvO9ThPhr
Nr8AnRbPC1BxIEOXqIOrMCS1ACy7YFU6
=y8/w
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread nch

More info on this:

- I do remote debugging through Eclipse to both tomcat on windows (same machine 
as eclipse, though) and tomcat on debian.

- I open a debugging port on tomcat by setting CATALINA_OPTS=-Xmx1024m -Xdebug 
-Xnoagent -Djava.compiler=NONE 
-Xrunjdwp:transport=dt_socket,address=4501,server=y,suspend=n

- When I send piraña it is allways encoded into the URL as pira%C3%B1a, 
whether running tomcat on windows, debian or even running my app into Jetty.

- When I send piraña, if I'm debugging tomcat on windows I can read piraña.

- If tomcat is running on debian, I read piraña.

- If I type piraña on http://www.us-webmasters.com/Decode-URLs/ and switch 
browser encoding display between ISO-8859-1 and UTF-8, I can see that when 
ISO-8859-1, then it displays piraña, when UTF-8, it displays piraña.

- When I run/debug my app on Jetty I get piraña (I've read on the web that 
Jetty decodes to UTF-8 by default).

- Something could be wrong in my debian environment. How can I find out about 
which env. varables is tomcat using?

- If I try to manually decode the returned parameter into my controller
by using URLDecoder.decode(query, UTF-8) then I can see no
difference. That is, when debugging the tomcat on windows the result is
piraña while debugging the one on debian the result is piraña.

- Is URLDecoder#decode environment dependent?

Hope this is useful. Lots of thanks to you all.



- Original Message 
From: Christopher Schultz [EMAIL PROTECTED]
To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 7:25:03 PM
Subject: Re: Character encoding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| I have a form that has an input field named query. I type piraña
| an submit the form using the GET method. I can see the browser has
| encoded this parameter into the URI as query=pira%C3%B1a

Is this a correct UTF-8 encoding of the parameter? I don't have my
unicode conversion chart handy right now.

| I set a breakpoint

Stop right there. If you are executing TC through a debugger, are you
sure that it is using its standard server.xml configuration?

| into the filter so when the request hits the filter I can see
| getCharacterEncoding() returns null. The filters sets it to UTF-8.

FYI, this has no bearing on the interpretation of the URI.

| Then the request gets to the controller where I can see the request
| parameter query is set to piraña.

Just in case it doesn't go through email very well, I see pir followed
by an A with a tilde over it, followed by a +/- symbol, followed by an
a. Definitely not right. Is that what you'd expect if you improperly
interpreted the UTF-8, URL-encoded piraña as if it were ISO-8859-1?

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZRO8ACgkQ9CaO5/Lv0PBXBQCeP3YKqnpJDO65N8lfvO9ThPhr
Nr8AnRbPC1BxIEOXqIOrMCS1ACy7YFU6
=y8/w
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

Re: Character encoding

2008-06-18 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| - I do remote debugging through Eclipse to both tomcat on windows
| (same machine as eclipse, though) and tomcat on debian.

Okay, remote debugging should not affect the server, but I'm still
wondering if the server.xml you think you are using is the one actually
being used. Try setting the Connector port to something crazy like
12345 and restarting. If you can still contact the server, then you are
either editing the wrong server.xml (there should only be one!) or your
changes are not being picked up.

| - When I send piraña it is always encoded into the URL as
| pira%C3%B1a, whether running tomcat on windows, debian or even
| running my app into Jetty.

That's because your browser is encoding it, not the server. So, it
doesn't depend on the server configuration (except possibly for the page
encoding, which often directs the browser to use utf-8 URI encoding).

| - If I type piraña on http://www.us-webmasters.com/Decode-URLs/ and
| switch browser encoding display between ISO-8859-1 and UTF-8, I can
| see that when ISO-8859-1, then it displays piraña, when UTF-8, it
| displays piraña.

I'm not sure what you think you're doing, there. When I paste that word
into the box to decode, I get broken output. There is no indication as
to what encoding the server expects for URIs.

Switching browser interpretation of the resulting page does not seem to
prove anything. The server never advertises any encoding to use, so the
browser just chooses whatever it wants. My browser chooses ISO-8859-1.
When I switch it to UTF-8, I see the expected interpretation. I'm not
sure what I just learned.

| - Something could be wrong in my debian environment. How can I find
| out about which env. varables is tomcat using?

Tomcat does not use any environment variables. The only settings that
affect the interpretation of the URI are the URIEncoding and
useBody... settings on the Connector. Are you using more than one
connector? Are you using Apache httpd out in front of Tomcat?

| - If I try to manually decode the returned parameter into my
| controller by using URLDecoder.decode(query, UTF-8) then I can see
| no difference. That is, when debugging the tomcat on windows the
| result is piraña while debugging the one on debian the result is
| piraña.

So, running this:

URLDecoder.decode(URLEncoder.encode(piraña, UTF-8), UTF-8);

...gives you piraña on your debian system? That doesn't seem right.

| - Is URLDecoder#decode environment dependent?

Nope. As long as you always provide the encoding to bs used, you should
be fine.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEUEARECAAYFAkhZZR0ACgkQ9CaO5/Lv0PCbTQCgm/eWN4Xphx9GQ4CTPZXNXdvn
rigAlA5l2731npViTS8ofT4cqSi5F6o=
=g6gT
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread nch

Chris, thanks for your help.
Please, see my comments bellow.
Kind regards.



- Original Message 
From: Christopher Schultz [EMAIL PROTECTED]
To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 9:42:21 PM
Subject: Re: Character encoding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| | - I do remote debugging through Eclipse to both tomcat on windows
| | (same machine as eclipse, though) and tomcat on debian.

| Okay, remote debugging should not affect the server, but I'm still
| wondering if the server.xml you think you are using is the one actually
| being used. Try setting the Connector port to something crazy like
| 12345 and restarting. If you can still contact the server, then you are
| either editing the wrong server.xml (there should only be one!) or your
| changes are not being picked up.

I'll try.

| | - When I send piraña it is always encoded into the URL as
| | pira%C3%B1a, whether running tomcat on windows, debian or even
| | running my app into Jetty.

| That's because your browser is encoding it, not the server. So, it
| doesn't depend on the server configuration (except possibly for the page
| encoding, which often directs the browser to use utf-8 URI encoding).

But, if the URL is allways encoded in the same way and tomcat does not receive 
any other information on what the resulting character encoding should be. Why 
do I get different values from tomcat?

| | - If I type piraña on http://www.us-webmasters.com/Decode-URLs/ and
| | switch browser encoding display between ISO-8859-1 and UTF-8, I can
| | see that when ISO-8859-1, then it displays piraña, when UTF-8, it
| | displays piraña.

| I'm not sure what you think you're doing, there. When I paste that word
| into the box to decode, I get broken output. There is no indication as
| to what encoding the server expects for URIs.

| Switching browser interpretation of the resulting page does not seem to
| prove anything. The server never advertises any encoding to use, so the
| browser just chooses whatever it wants. My browser chooses ISO-8859-1.
| When I switch it to UTF-8, I see the expected interpretation. I'm not
| sure what I just learned.

If we take a look into this page src code we can see the following line:
 META HTTP-EQUIV=Content-Type CONTENT=text/html; charset=ISO-8859-1
I assume the this site expects ISO-8859-1 from the browser and so it decodes it 
into ISO-8859-1.
In the case of Piraña it decodes to piraña which is same as what tomcat 
gives to my controller, even though I'm explicitly telling it to decode to 
UTF-8.

| | - Something could be wrong in my debian environment. How can I find
| | out about which env. varables is tomcat using?

| Tomcat does not use any environment variables. The only settings that
| affect the interpretation of the URI are the URIEncoding and
| useBody... settings on the Connector. Are you using more than one
| connector? Are you using Apache httpd out in front of Tomcat?

Ah, I forgot to mention. I do have an apache httpd in front of tomcat, but for 
testing purposes I'm directly accessing tomcat through port 8080. Anyway, it 
yields same results whether directly accessing tomcat or through httpd.
So, if tomcat doesn't read env. variables, why would debian packagers try to 
set LANG to system default into their tomcat init script? Does that make sense?
BTW, the instance of tomcat I'm running on debian was manually downloaded from 
tomcat.apache.org

| | - If I try to manually decode the returned parameter into my
| | controller by using URLDecoder.decode(query, UTF-8) then I can see
| | no difference. That is, when debugging the tomcat on windows the
| | result is piraña while debugging the one on debian the result is
| | piraña.

| So, running this:

| URLDecoder.decode(URLEncoder.encode(piraña, UTF-8), UTF-8);
|
| ...gives you piraña on your debian system? That doesn't seem right.

I realise this test is crap :-) because I'm passing URLEncoder.encode an 
already decoded parameter. I'm tired ...
I'll try to get the raw url parameter.

| | - Is URLDecoder#decode environment dependent?

| Nope. As long as you always provide the encoding to bs used, you should
| be fine.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEUEARECAAYFAkhZZR0ACgkQ9CaO5/Lv0PCbTQCgm/eWN4Xphx9GQ4CTPZXNXdvn
rigAlA5l2731npViTS8ofT4cqSi5F6o=
=g6gT
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

Re: Character encoding

2008-06-18 Thread nch
You say:
Tomcat does not use any environment variables. The only settings that
affect the interpretation of the URI are the URIEncoding and
useBody... settings on the Connector. Are you using more than one
connector? Are you using Apache httpd out in front of Tomcat?

Perhaps the JVM does and so tomcat read them indirectly through it??

Cheers



- Original Message 
From: Christopher Schultz [EMAIL PROTECTED]
To: Tomcat Users List users@tomcat.apache.org
Sent: Wednesday, June 18, 2008 9:42:21 PM
Subject: Re: Character encoding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| - I do remote debugging through Eclipse to both tomcat on windows
| (same machine as eclipse, though) and tomcat on debian.

Okay, remote debugging should not affect the server, but I'm still
wondering if the server.xml you think you are using is the one actually
being used. Try setting the Connector port to something crazy like
12345 and restarting. If you can still contact the server, then you are
either editing the wrong server.xml (there should only be one!) or your
changes are not being picked up.

| - When I send piraña it is always encoded into the URL as
| pira%C3%B1a, whether running tomcat on windows, debian or even
| running my app into Jetty.

That's because your browser is encoding it, not the server. So, it
doesn't depend on the server configuration (except possibly for the page
encoding, which often directs the browser to use utf-8 URI encoding).

| - If I type piraña on http://www.us-webmasters.com/Decode-URLs/ and
| switch browser encoding display between ISO-8859-1 and UTF-8, I can
| see that when ISO-8859-1, then it displays piraña, when UTF-8, it
| displays piraña.

I'm not sure what you think you're doing, there. When I paste that word
into the box to decode, I get broken output. There is no indication as
to what encoding the server expects for URIs.

Switching browser interpretation of the resulting page does not seem to
prove anything. The server never advertises any encoding to use, so the
browser just chooses whatever it wants. My browser chooses ISO-8859-1.
When I switch it to UTF-8, I see the expected interpretation. I'm not
sure what I just learned.

| - Something could be wrong in my debian environment. How can I find
| out about which env. varables is tomcat using?

Tomcat does not use any environment variables. The only settings that
affect the interpretation of the URI are the URIEncoding and
useBody... settings on the Connector. Are you using more than one
connector? Are you using Apache httpd out in front of Tomcat?

| - If I try to manually decode the returned parameter into my
| controller by using URLDecoder.decode(query, UTF-8) then I can see
| no difference. That is, when debugging the tomcat on windows the
| result is piraña while debugging the one on debian the result is
| piraña.

So, running this:

URLDecoder.decode(URLEncoder.encode(piraña, UTF-8), UTF-8);

...gives you piraña on your debian system? That doesn't seem right.

| - Is URLDecoder#decode environment dependent?

Nope. As long as you always provide the encoding to bs used, you should
be fine.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEUEARECAAYFAkhZZR0ACgkQ9CaO5/Lv0PCbTQCgm/eWN4Xphx9GQ4CTPZXNXdvn
rigAlA5l2731npViTS8ofT4cqSi5F6o=
=g6gT
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

Re: Character encoding

2008-06-18 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| But, if the URL is allways encoded in the same way and tomcat does
| not receive any other information on what the resulting character
| encoding should be. Why do I get different values from tomcat?

Because the servers are configured differently (probably is some very
small way). The problem is that the HTTP spec is ... hazy when it comes
to how URIs should be interpreted. The spec says that most servers
expect ISO-8859-1, but many clients are (rightfully so, IMO) switching
to UTF-8. This leaves us developers in a limbo where we have to beat our
servers into submission and cross our fingers when decoding URIs.

| | Tomcat does not use any environment variables. The only settings that
| | affect the interpretation of the URI are the URIEncoding and
| | useBody... settings on the Connector. Are you using more than one
| | connector? Are you using Apache httpd out in front of Tomcat?
|
| Ah, I forgot to mention. I do have an apache httpd in front of
| tomcat, but for testing purposes I'm directly accessing tomcat through
port
| 8080. Anyway, it yields same results whether directly accessing tomcat
| or through httpd.

If you have multiple Connectors (one for AJP and one for HTTP), are
you setting the URIEncoding=utf-8 on both of them, or only one of
them? It would help if you posted your entire server.xml.

| So, if tomcat doesn't read env. variables, why would debian packagers
| try to set LANG to system default into their tomcat init script?

Probably to make it more consistent with the rest of the packages they
support. They want you to be able to set LANG=foo and have it change
everything for all services.

| Does that make sense?

I think it /does/ make sense, but it often confuses the issue when
you're dealing with someone who is NOT using, say, debian.

Note that there are no external factors for URI decoding. The only
setting that can change it is the URIEncoding attribute of the
Connector. It does not fall-back to the system Locale's preferred
encoding or file.encoding or anything weird like that. It /always/
falls-back to ISO-8859-1, regardless of any other settings.

| BTW, the instance of tomcat I'm running on debian was manually
| downloaded from tomcat.apache.org

The only reason it would be an issue is if the configuration was not
what you expected it to be (for instance, the server.xml you are editing
is not the one that TC is actually using).

- -chris

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZeP8ACgkQ9CaO5/Lv0PA3ngCeMSw/ltgABrIKpVsqb+HEqAa9
KP0Aniac1roIDr0rPBl098vfGxlnVf7p
=RGzQ
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2008-06-18 Thread Christopher Schultz

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

nch,

nch wrote:
| You say:
| Tomcat does not use any environment variables. The only settings that
| affect the interpretation of the URI are the URIEncoding and
| useBody... settings on the Connector. Are you using more than one
| connector? Are you using Apache httpd out in front of Tomcat?
|
| Perhaps the JVM does and so tomcat read them indirectly through it??

You can read the code for the connector. Those settings are the only
relevant ones.

- -chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhZek0ACgkQ9CaO5/Lv0PBDvQCguIgu+QMTjKDxua3CS0cn9Gd0
AEoAoIZTNaJpiI8Xv3szp9O+3eANIGK0
=+VmT
-END PGP SIGNATURE-

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [OT] Re: Character encoding

2007-07-09 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

lightbulb,

lightbulb432 wrote:
 POST requests always use the request's body encoding, which is
 specified in the HTTP header (and can be overridden by using 
 request.setCharacterEncoding). Some broken clients don't provide 
 the character encoding of the request, which makes things difficult
 sometimes.
 
 What determines what's specified in the HTTP header for the value of the
 encoding?

Well... it's a bit of a chicken-in-an-egg scenario, since the encoding
specified in the header must match the encoding actually used in the
request. So, you could either decide that the header should match the
content or the content should match the header.

 Is it purely up to the user agent, or can Tomcat provide hints
 based on previous requests how to encode it - or is it something up to the
 end user to set in their browser (in IE, View - Encoding)?

Typically, the default encoding used by the user-agent will be
locale-specific. For instance, most browsers in the US will use
ISO-8859-1 as the default locale, or maybe WINDOWS-1252 if you're
unlucky. Ideally, the server should be able to accept all reasonable
encodings. The Accept-Charset header sent by the user-agent to the
server indicates the acceptable encodings that should be returned, rated
by acceptability. For instance, my en_US Mozilla Firefox on Windows
sends this Accept-Charset string to servers:

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

This indicates that the browser would prefer ISO-8859-1 encoding, but
will also accept UTF-8 as a second choice, but that anything will do
('*') if those two are unavailable.

On HTML form elements, you may override the encoding used to send the
data:

form accept-charset=UTF-8

The HTML 4 specification says this about the accept-charset attribute:
The default value for this attribute is the reserved string UNKNOWN.
User agents may interpret this value as the character encoding that was
used to transmit the document containing this FORM element.
(http://www.w3.org/TR/html4/interact/forms.html#h-17.3)

So, if the server sends a document using UTF-8, it is polite for the
user-agent to use that same encoding to respond to the server if the
server hasn't indicated any preference by using the accept-charset
form attribute.

 In what cases would you call request.setCharacterEncoding to override the
 value specified by the user agent?

You should only do this when the user-agent does not declare the charset
being used in the body of the request through the Content-Type request
header. You should also only do this when you are relatively confident
that the user-agent is sending the data in the overridden character set.

For instance, if you suspect that most browsers adhere to the W3C's
recommendation above that an UNKNOWN accept-charset implies that the
browser should respond to the server with the same charset as used in
the previous server response (got all that?), and you always use the
same charset to send pages (say, UTF-8), they it is reasonable to
override any unspecified Content-Type encoding with the charset you use
to send pages (UTF-8, in this case).

The HTTP specification has this to say about missing charsets (in
Content-Type headers):
  The charset parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the text
   type are defined to have a default charset value of ISO-8859-1 when
   received via HTTP. Data in character sets other than ISO-8859-1 or
   its subsets MUST be labeled with an appropriate charset value. See
   section 3.4.1 for compatibility problems.
(http://www.ietf.org/rfc/rfc2616.txt Section 3.7.1)

Basically, this says that a missing charset within a Content-Type header
means that the request should be interpreted as being encoded using
ISO-8859-1 encoding. Pretty simple.

 Shouldn't you trust the user agent rather
 than trying to guess? (Or is this only used in cases where the user agent is
 broken, like you said - but then how would you know you're dealing with a
 broken client to begin with...aah, complicated!)

You should /always/ respect the charset sent by the client. In fact, the
HTTP spec says so:
HTTP/1.1 recipients MUST respect the charset label provided by the sender;
(http://www.ietf.org/rfc/rfc2616.txt Section 3.4.1)

If the client sends the wrong charset, it's their fault that their data
will get all screwed up.

But, if there's no charset, then you should provide your own. The
default charset should be ISO-8859-1. I think Tomcat uses the default
encoding of the JVM if no charset is provided, which is a problem for
folks who set the JVM encoding to UTF-8 for i18n purposes... because
then the default becomes UTF-8 which is incorrect. Fortunately, UTF-8
and ISO-8859-1 are compatible for most common lower ASCII characters.
This has lead to a lot of folks thinking that they have their servers
configured 

[OT] Re: Character encoding

2007-07-08 Thread lightbulb432

That was a really great set of answers, thanks! These follow-ups are somewhat
off-topic to Tomcat, but you really know this stuff well so I hope you don't
mind addressing them:


POST requests always use the request's body encoding, which is specified
in 
 the HTTP header (and can be overridden by using 
 request.setCharacterEncoding). Some broken clients don't provide the 
 character encoding of the request, which makes things difficult sometimes.

What determines what's specified in the HTTP header for the value of the
encoding? Is it purely up to the user agent, or can Tomcat provide hints
based on previous requests how to encode it - or is it something up to the
end user to set in their browser (in IE, View - Encoding)?

In what cases would you call request.setCharacterEncoding to override the
value specified by the user agent? Shouldn't you trust the user agent rather
than trying to guess? (Or is this only used in cases where the user agent is
broken, like you said - but then how would you know you're dealing with a
broken client to begin with...aah, complicated!)



You shouldn't have to worry about cookie encoding, since you can always
 call request.getCookies() and get them correctly interpreted for you.

What do you mean by this? Does it mean (pardon the surely messed up use of
the API below) in your response.addCookie(), you add a cookie where the
value has cookie.setValue(new String(charByteArray,UTF-8)) then you read
it back using responseCookie.getValue().getBytes(UTF-8)? (Where UTF-8 is
whatever encoding you're using internally in your application.)


Finally, what's the default encoding used by the response when
response.setCharacterEncoding(myEncoding) isn't called? Am I correct to
assume that if that default is not the default Java String encoding of
UTF-16, then you MUST call convert all the Strings you've outputted to that
encoding? (...because the HTTP header expects whatever the default is, but
Java is outputting UTF-16 encoded text to the actual response bytes)

Am I speaking rubbish here, or am I thinking about these concepts in the
right way?

Thanks a lot.

P.S. How did you learn all of that?!




Christopher Schultz-2 wrote:
 
 Lightbulb,
 
 lightbulb432 wrote:
 Why is the URIEncoding attribute specified on the connector rather than
 on a
 host, for example?
 
 Because the host doesn't handle connections... the connectors do.
 
 Does this mean that the number of virtual hosts that can
 listen on the same port on the same box are limited by whether they all
 use
 the same encodings in their URIs?
 
 Yes, all virtual hosts listening on the same port will have to have the
 same encoding. Fortunately, UTF-8 works for all languages that I know of.
 
 Now that I think about it, wouldn't it be
 at the context level, not even at the host level?
 
 If you had a connector-per-context, yes, but that's no the case.
 
 In Tomcat 6, should the useBodyEncodingForURI be used if not needing
 compatibility with 4.1, as the documentation mentions? 
 
 I would highly recommend following that recommendation.
 
 To see if I have things straight, is HttpServletRequest's
 get/setCharacterEncoding used for both the request parameters from a GET
 request AND the contents of the POST?
 
 No. GET requests have request parameters encoded as part of the URL,
 which is affected by the Connector's URIEncoding parameter. POST
 requests always use the request's body encoding, which is specified in
 the HTTP header (and can be overridden by using
 request.setCharacterEncoding). Some broken clients don't provide the
 character encoding of the request, which makes things difficult sometimes.
 
 How are multipart POST requests dealt with?
 
 Typically, each part of a multipart request contains its own character
 encoding, so a multipart POST would follow the encoding for the part
 you're reading at the time.
 
 And HttpServletResponse's get/setCharacterEncoding is used for the
 contents
 of the response header and the meta tags?
 
 Only for the header field, not META tags. If you want to emit META tags,
 you'll have to do them yourself.
 
 Does it also encode the page content itself? 
 
 Nope. If you change the character encoding for a response after the
 response has already had some data written to it, I think you'll send an
 incorrect header. For instance:
 
 response.setCharacterEncoding(ISO-8859-1);
 PrintWriter out = response.getOutputWriter();
 
 response.setCharacterEncoding(Big5);
 
 out.print(abcdef);
 out.flush();
 
 Your client will not receive a sane response. Setting the character
 encoding only sets the HTTP response header and configures the
 response's Writer, if used, but only /before/ calling getWriter the
 first time.
 
 What about the encoding of cookies for both incoming requests and
 outgoing
 responses?
 
 See the HTTP spec, section 4.2 (Message Headers). It references RFC
 822 (ARPA Internet text messages) which does not actually specify a
 character encoding. From what I can see, low ASCII 

Re: Character encoding

2007-07-07 Thread Christopher Schultz
Lightbulb,

lightbulb432 wrote:
 Why is the URIEncoding attribute specified on the connector rather than on a
 host, for example?

Because the host doesn't handle connections... the connectors do.

 Does this mean that the number of virtual hosts that can
 listen on the same port on the same box are limited by whether they all use
 the same encodings in their URIs?

Yes, all virtual hosts listening on the same port will have to have the
same encoding. Fortunately, UTF-8 works for all languages that I know of.

 Now that I think about it, wouldn't it be
 at the context level, not even at the host level?

If you had a connector-per-context, yes, but that's no the case.

 In Tomcat 6, should the useBodyEncodingForURI be used if not needing
 compatibility with 4.1, as the documentation mentions? 

I would highly recommend following that recommendation.

 To see if I have things straight, is HttpServletRequest's
 get/setCharacterEncoding used for both the request parameters from a GET
 request AND the contents of the POST?

No. GET requests have request parameters encoded as part of the URL,
which is affected by the Connector's URIEncoding parameter. POST
requests always use the request's body encoding, which is specified in
the HTTP header (and can be overridden by using
request.setCharacterEncoding). Some broken clients don't provide the
character encoding of the request, which makes things difficult sometimes.

 How are multipart POST requests dealt with?

Typically, each part of a multipart request contains its own character
encoding, so a multipart POST would follow the encoding for the part
you're reading at the time.

 And HttpServletResponse's get/setCharacterEncoding is used for the contents
 of the response header and the meta tags?

Only for the header field, not META tags. If you want to emit META tags,
you'll have to do them yourself.

 Does it also encode the page content itself? 

Nope. If you change the character encoding for a response after the
response has already had some data written to it, I think you'll send an
incorrect header. For instance:

response.setCharacterEncoding(ISO-8859-1);
PrintWriter out = response.getOutputWriter();

response.setCharacterEncoding(Big5);

out.print(abcdef);
out.flush();

Your client will not receive a sane response. Setting the character
encoding only sets the HTTP response header and configures the
response's Writer, if used, but only /before/ calling getWriter the
first time.

 What about the encoding of cookies for both incoming requests and outgoing
 responses?

See the HTTP spec, section 4.2 (Message Headers). It references RFC
822 (ARPA Internet text messages) which does not actually specify a
character encoding. From what I can see, low ASCII is the encoding used.
You shouldn't have to worry about cookie encoding, since you can always
call request.getCookies() and get them correctly interpreted for you.

-chris



signature.asc
Description: OpenPGP digital signature


Re: Character encoding

2006-12-18 Thread Mester József
Hello Mark

Mester József wrote:
 Ok. Let's see my problem. 
 I have a form with text input box. I type Árvíztűrő tükörfúrógép and I get  
 ÃrvíztűrÅ tükörfúrógép

I have tested this with the latest 5.5.x source and it works correctly
(there haven't been any encoding related fixes since 5.5.20). Have you
got the request dumper valve enabled? This causes all request
parameters to be processed as ISO-8859-1 and enabling it was the only
way I could replicate the behaviour you see.

I develop with Netbeans 5.5 and my servlet container is Netbean's bundled 
Tomcat. (5.5.17) I don't changed anything in tomcat's settings.
What is request dumper valve? And where can I set?

If you haven't got this valve enabled, check you application for
filters, valves etc that may read request parameters before your
request.setCharacterEncoding(UTF-8) is called. Note parameters are
only read once so if the encoding is wrong then you can't easily fix it.

There are no filters in my application.

Joe











___ 
Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. 
http://uk.docs.yahoo.com/nowyoucan.html

Re: Character encoding

2006-12-16 Thread Mark Thomas
Mester József wrote:
 Hello Mark

 Ok. Let's see my problem. 
 I have a form with text input box. I type Árvíztűrő tükörfúrógép and I get  
 ÃrvíztűrÅ tükörfúrógép
 
 Beautiful isn't it? 

I have tested this with the latest 5.5.x source and it works correctly
(there haven't been any encoding related fixes since 5.5.20). Have you
got the request dumper valve enabled? This causes all request
parameters to be processed as ISO-8859-1 and enabling it was the only
way I could replicate the behaviour you see.

If you haven't got this valve enabled, check you application for
filters, valves etc that may read request parameters before your
request.setCharacterEncoding(UTF-8) is called. Note parameters are
only read once so if the encoding is wrong then you can't easily fix it.

Mark

-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character encoding

2006-12-15 Thread Mester József
Hello Mark

This is unlikely to help you and may be read-only on your JVM.

You don't say what doesn't work but generally the following is required:
set URIEncoding=UTF-8 on the connector
set the the correct response encoding on every response (you can do
this per page or use a filter to do this for all pages)
Ok. Let's see my problem. 
I have a form with text input box. I type Árvíztűrő tükörfúrógép and I get  
ÃrvíztűrÅ tükörfúrógép


Beautiful isn't it? 
The page is:

[EMAIL PROTECTED] contentType=text/html%
[EMAIL PROTECTED] pageEncoding=UTF-8%

!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
   http://www.w3.org/TR/html4/loose.dtd;

html
head
meta http-equiv=Content-Type content=text/html; charset=UTF-8
titleTry encoding/title
/head
body
% try {
request.setCharacterEncoding(UTF-8);
}
catch (Exception ex) {
out.println(Bad something:  +ex.getMessage());
} %
Hello
%=request.getParameter(nev)%
br
form accept-charset=UTF-8 action=index.jsp method=POST
input type=text name=nev
input type=submit value=Send name
/form

/body
/html



If you use a database make sure that you persist your data in the
correct encoding.
If my text came from database everything is correct.

If you convert from bytes to characters or characters to bytes makde
sure you use the correct encoding.
I don't.



Joe






___ 
Try the all-new Yahoo! Mail. The New Version is radically easier to use – The 
Wall Street Journal 
http://uk.docs.yahoo.com/nowyoucan.html

Re: Character encoding

2006-12-12 Thread olivier nouguier

export CATALINA_OPTS=-Dfile.encoding=UTF-8

On 12/12/06, Mester József [EMAIL PROTECTED] wrote:


Hi
I have some problem with character encoding. I have found a page (
http://junlu.com/msg/1132.html ) and on this page there is a direction:

2.
In the Catalina.bat (windows) catalina.sh (linux) there must be a switch
added to the call to java.exe.  The
switch is:
-Dfile.encoding=UTF-8

But I don't know where can I add this switch in catalina.sh

I use Tomcat 5.5.20 on Debian Sarge

Joe




Send instant messages to your online friends http://uk.messenger.yahoo.com





--
Souviens-toi qu'au moment de ta naissance tout le monde était dans la joie
et toi dans les pleurs.
Vis de manière qu'au moment de ta mort, tout le monde soit dans les pleurs
et toi dans la joie.


Re: Character encoding, once again....

2006-08-14 Thread dizzi
THANKS, that URIencoding property of HTTP connector was source of GET  
problems


I tried to remove filter after that, but POST requests stop working. So  
ive instaled filter back.


Now I have working both GET and POST. Aleluja...

d.

Anyway its

On Mon, 14 Aug 2006 22:40:27 +0200, Mark Thomas [EMAIL PROTECTED] wrote:


dizzi wrote:

Im not sure if this is problem of tomcat, but i think that its most
probable.


Unlikely. I haven't seen a valid bug in this area for quite some time.
It is usually a combination of configuration (check the URIEncoding
property of your connector) and application errors. For a correctly
coded application, the content-encoding filter should be unnecessary.

I'd start with a simple application like this one and build up to the
form that is causing problems.

http://marc.theaimsgroup.com/?l=tomcat-userm=111548442910292w=2

Mark


-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character Encoding : Unix vs Windows

2006-04-03 Thread Michael Jouravlev
On 4/3/06, Nigel Blake [EMAIL PROTECTED] wrote:
 Problem : Creating a URL type with parameters that have a space
 between them causes an IOException in a javabean when called from
 Tomcat 5.0.0.27 on a Unix installation. Using the same bean and JSP
 code causes no problem when invoked on the same version of Tomcat on a
 Windows installation.

 Solutions tried :

 1.Ensured that the server connector encoding is UTF-8 (suggested in the FAQ)
 2. Have ensured that jsp the page instruction is UTF-8
 3. I could turn the bean into a servlet and try using the
 setContentType or SetCharacterEncoding. ( I would rather not )

 Any suggestions that would make Unix implementation work would be
 gratefully received. I have run out of ideas...

URLEncoder.encode(), URLEncoder.decode()

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Character Encoding : Unix vs Windows

2006-04-03 Thread Derrick Koes
java.net.URLEncoder.encode 

-Original Message-
From: Nigel Blake [mailto:[EMAIL PROTECTED] 
Sent: Monday, April 03, 2006 5:43 PM
To: users@tomcat.apache.org
Subject: Character Encoding : Unix vs Windows

Problem : Creating a URL type with parameters that have a space between
them causes an IOException in a javabean when called from Tomcat
5.0.0.27 on a Unix installation. Using the same bean and JSP code causes
no problem when invoked on the same version of Tomcat on a Windows
installation.

Solutions tried :

1.Ensured that the server connector encoding is UTF-8 (suggested in the
FAQ) 2. Have ensured that jsp the page instruction is UTF-8 3. I could
turn the bean into a servlet and try using the setContentType or
SetCharacterEncoding. ( I would rather not )

Any suggestions that would make Unix implementation work would be
gratefully received. I have run out of ideas...

Thanks Nigel


Example code :

URL birdSite = new
URL(http://orientalbirdimages.org/search.php?keyword=black bittern);

try {

  webPageStream = new BufferedReader(new InputStreamReader(birdSite.
  openStream()));
}
catch (MalformedURLException ne) {
  System.out.println(
  Malformed URL Error called from within getPageNumber() +
ne.toString());
}
catch (IOException ie) {
  System.out.println(IOException called from within getPageNumber
+ ie.toString());
}



The IOException is caught under unix when the variable I pass to the URL
query string has  a query parameter of more than more than 1 word as in
'black bittern' above.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK

2005-10-19 Thread birendar . waldiya
Notice: The information contained in this e-mail message and/or attachments to 
it may contain confidential or privileged information.   If you are not the 
intended recipient, any dissemination, use, review, distribution, printing or 
copying of the information contained in this e-mail message and/or attachments 
to it are strictly prohibited.   If you have received this communication in 
error, please notify us by reply e-mail or telephone and immediately and 
permanently delete the message and any attachments.  Thank you
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK

2005-10-19 Thread LORESERVO.COM
Please don´t send more emails I´m not tomcat user 

-Mensaje original-
De: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Enviado el: Miércoles, 19 de Octubre de 2005 04:20 a.m.
Para: Tomcat Users List
Asunto: RE: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK

Notice: The information contained in this e-mail message and/or attachments
to it may contain confidential or privileged information.   If you are not
the intended recipient, any dissemination, use, review, distribution,
printing or copying of the information contained in this e-mail message
and/or attachments to it are strictly prohibited.   If you have received
this communication in error, please notify us by reply e-mail or telephone
and immediately and permanently delete the message and any attachments.
Thank you



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK

2005-10-18 Thread afonseca
Hi,

In Europe we have lots of languages. I don't think it's true that UTF-8 can 
handle european character very well.There is a list in the net (I don't know 
here) with the other ISO encoding for other languages.

AF

Citando David Delbecq [EMAIL PROTECTED]:

 Hi,
 
 UTF-8 can handle european and chinese character very well.
 If you can't read using utf-8 any of those this simply
 mean you text file is not saved in utf-8.
 
 [EMAIL PROTECTED] a écrit :
 
 Hi,
 I am trying to read the universal charater form a text file to my java
 application that stores them in database. When I use  encoding type GBK i
 can read all special charater in chinease, when i use encoding ISO-8859-1
 i can read latin but not chinease , but whn i use encoding as UTF-8 i
 think i ma supposed to read both chinease and latin correctly but i am not
 able to read any of them. Can any one give me the pointers for solution ,
 Further the beta- is converted to ss in latin-1
 
 thanks in advance
 Birendar S Waldiya
 
 
 Notice: The information contained in this e-mail message and/or attachments
 to it may contain confidential or privileged information.   If you are not
 the intended recipient, any dissemination, use, review, distribution,
 printing or copying of the information contained in this e-mail message
 and/or attachments to it are strictly prohibited.   If you have received this
 communication in error, please notify us by reply e-mail or telephone and
 immediately and permanently delete the message and any attachments.  Thank
 you
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK

2005-10-18 Thread David Delbecq
UTF-8 (8-bit Unicode Transformation Format) is a lossless,
variable-length character encoding for Unicode created by Ken Thompson
and Rob Pike. It uses groups of bytes to represent the Unicode standard
for the alphabets of many of the world's languages. UTF-8 is especially
useful for transmission over 8-bit Electronic Mail systems.
http://en.wikipedia.org/wiki/UTF-8

In computing, Unicode provides an international standard which has the
goal of providing the means to encode the text of every document people
want to store on computers. This includes all scripts in active use
today, many scripts known only by scholars, and symbols which do not
strictly represent scripts, like mathematical, linguistic and APL symbols.
http://en.wikipedia.org/wiki/Unicode


[EMAIL PROTECTED] a écrit :

Hi,

In Europe we have lots of languages. I don't think it's true that UTF-8 can 
handle ALL european character very well.There is a list in the net (I don't 
know here) with the other ISO encoding for other languages.

AF

Citando David Delbecq [EMAIL PROTECTED]:

  

Hi,

UTF-8 can handle european and chinese character very well.
If you can't read using utf-8 any of those this simply
mean you text file is not saved in utf-8.

[EMAIL PROTECTED] a écrit :



Hi,
I am trying to read the universal charater form a text file to my java
application that stores them in database. When I use  encoding type GBK i
can read all special charater in chinease, when i use encoding ISO-8859-1
i can read latin but not chinease , but whn i use encoding as UTF-8 i
think i ma supposed to read both chinease and latin correctly but i am not
able to read any of them. Can any one give me the pointers for solution ,
Further the beta- is converted to ss in latin-1

thanks in advance
Birendar S Waldiya


Notice: The information contained in this e-mail message and/or attachments
  

to it may contain confidential or privileged information.   If you are not
the intended recipient, any dissemination, use, review, distribution,
printing or copying of the information contained in this e-mail message
and/or attachments to it are strictly prohibited.   If you have received this
communication in error, please notify us by reply e-mail or telephone and
immediately and permanently delete the message and any attachments.  Thank
you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Character Encoding -ISo-8859-1 Vs UTF-8 Vs GBK

2005-10-18 Thread afonseca
Sorry, my mistake! I thought we were speaking about something else...

AF

Citando Peter Crowther [EMAIL PROTECTED]:

  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  I don't think it's true
  that UTF-8 can handle ALL european character very well.
 
 If it can't, the Unicode consortium (http://www.unicode.org/) will be
 pretty worried, as UTF-8 is an encoding of Unicode...
 
   - Peter
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]