Tuomo L wrote:
Ok, now I'm really confused.
In Bruno's excellent paper about Cocoon encoding, there's a section that says:
"For Java-insiders: what Cocoon actually does internally is apply the following trick to get a parameter correctly decoded: suppose "value" is a string containing a request parameter, then Cocoon will do:
value = new String(value.getBytes("ISO-8859-1"), "UTF-8"); "
correct.
this trick is the re-en-decoding
we get a string from getParameter, we encode it to bytes with ISO-8859-1 and decode from there with UTF-8
why? to correct the container's mistake
the container will have received bytes (let's call these the original-request-parameter-bytes) but will have applied his 'container-encoding' on those to be able to return a String over getParameter.
NOTE: this container encoding is a property of your chosen container and typically fixed to being iso-8859-1, unless you are running jetty with the mentioned charset-property set you should never changes this)
now, cocoon knows from the form-encoding in which encoding forms have been serialized out, and thus how request params will be *really* encoded
so to correct the error the container made we encode back to the original bytes using latin-1 and then apply the correct form-encoding (utf-8)
between servlet-spec 2.2 and 2.3 this issue occured to the peeps doing the spec and they added setCharacterEncoding() to the servlet-request and mention explicitely that you need to call that before reading any getParameter (or any related action that requires to parse and thus decode the query-string)
But then in the bug report for Xalan (someone having this same problem) it says:
"According to section 16.2 of the XSLT Recommendation [1], non-ASCII characters in URI attribute values should be escaped using the method recommended in Section B.2.1 of the HTML 4.0 Recommendation [2]. The latter recommends that non-ASCII characters be represented in UTF-8 prior to applying the "%HH" escaping described by the URI RTF, regardless of the output encoding."
nifty, didn't know... so whatever output encoding you set the uri's will be utf-8 encoded, and then url-encoded?
haven't ever seen this, I was under the impression that to xalan attributes were just attributes and would have expected characters to be replaced by character-entity-refs depending on if they are supported or not by the applied output-encoding
This is what Xalan does (HTML serialization), so it obeys the spec.
Correct me if I'm wrong, but during serialization if there are special characters (above 128) in an URL:s request parameters (href-attributes etc.), they are first encoded in UTF-8 by Xalan. Even if the browser
apparently, would like to see some test evidence to be on the safe side though
detects the page as ISO-8859-1 or anything else, these URL:s in the HTML source contain parameters in UTF-8. Now, when user clicks on this link,
but it is not about request-parameters is it? it is about the proper URL part, no?
as in:
http://server:port/path/more-path?request-param=value ---------------------------------|------------------- >> area-not-fixed-by-cocoon << | >> area fixed by cocoon <<
(in fact I'm even doubthing if we are fixing the names of the request-params (actually my guess would be we're only doing the values))
see http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/environment/http/HttpRequest.java?rev=55600&root=Apache-SVN&view=auto
there is the internal decode() method. it gets only called from areas that do with request-parameter-values (as I started to think: not even the names)
Cocoon reads the request parameters in as ISO-8859-1, and converts them to UTF-8, without knowing that these parameters were already UTF-8!
nope, don't think so... first nuance (see above) the container reads and applies (typically) ISO-8859-1,...
and cocoon correctly re-encodes request-parameter-values based on its 'form-encoding', but isn't (at least to my knowledge) touching the url part of things
(sorry for the confusion but that exactly was the executive summary from my previous post)
hope this clarifies the issue hope this strengthens your trust in the proposed workarounds...
-marc=
My knowledge of the Cocoon internals is not very good, but could this be the problem?
-Tuomo
On Fri, 29 Oct 2004, Marc Portier wrote:
just scanning through this issue fast it seems to me like more evidence of things expressed here: http://marc.theaimsgroup.com/?t=109231177100007&r=1&w=2
rehashing what I read from Tuomo's setup:
- cocoon-servlet init params are set to have container-encoding unchanged (thus iso_8859_1) like we recommend and form-encoding to utf-8 to make sure his forms can support wide variety of characters
- as a consequence of this last setting (and the wellknown browser-limitation) this means we need to sync the encoding on the serializer to this same utf-8
- because of this setting there is no reason to complain about the resulting HTML, that is full of utf-8 encoding, no need to refer to specs or blame cocoon: xml serialization was requested to use utf-8 so it does (even xalan does its work here I suppose)
now, what goes wrong?
well, I had planned to get into this during gt2004s hackathon but got distracted on other issues. Lacking the experience of the in depth debugging session I can't really do more then express my current 'suspicions'
(as stated in the thread above)
we've done quite a good job at solving the issue regarding encodings of request-parameters and even extended the servlet 2.3 new insights in doing so (setRequestEncoding()) to support even 2.2 containers
however, one important part of the request object set of getters is escaping this: the URL (and some of its derived 'paths' as well I assume)
This explains why encoding in form-request params gets fixed correctly, but the url itself remains broke --> consequence:
- you can't link to non-latin-char-urls but you can pass non-latin-request-params
in more cocoon detail this means you can't expect cocoon matchers to get correctly triggered by non-latin-urls as well as you can't automount sitemaps in directories with non-latin-only-names...
(or read resources with non-latin-only-names as the original post of the other thread was about)
Suggestion:
1. do some tests to verify above and list them as known limitations on appropriate wikis. --> tell about the two workarounds:
a/ to avoid non-latin urls (even if w3c says all urls should be utf-8 encoded)
b/ use jetty, set org.mortbay.util.URI.charset property and then DO change the cocoon 'container-encoding' param accordingly
2. (assuming my analysis is correct and gets confirmed by the tests) extend our http-wrapping-encoding-fix to include the urls and paths as well (using the tests as a way to verify the success of this)
3. start the crusade for the abolishment of all encodings but utf-8!
The time consuming part here is jamming together an easy deployable testsuite (zip with automount sitemap and all needed stuff inside) covering the various aspects... would be cool if somebody else could be doing that...
regards, -marc=
Joerg Heinicke wrote:
On 29.10.2004 08:44, Tuomo L wrote:
We're having some serious encoding problems. This happens only with the @href attributes in html, when using characters like �, � and � (in Finnish alphabet). Form encoding works just fine. I've gone through all the threads concerning encoding (other people having encoding problems too). No luck so far. Is this still an issue in Cocoon? Could someone please tell what's wrong?
What's the page encoding? Forms work like expected? Just the links don't work? This normally points to a different page encoding than UTF-8 as link requests are encoded in UTF-8 while form requests are encoded in page encoding. I don't think it is a Cocoon issue.
First a link about all the encodings: http://wiki.apache.org/cocoon/RequestParameterEncoding (mostly written by Bruno).
According to IE, the page encoding is set to UTF-8. The
container-encoding and form-encoding in web.xml (Tomcat) are set to UTF-8.
The container-encoding should not be touched at all and remain ISO-8859-1.
HTMLSerializer is set to use UTF-8 (mime-type="text/html; charset=utf-8")
and has the parameter <encoding>UTF-8</encoding>.
This should result in <meta http-equiv="Content-Type" content="text/html;charset=utf-8">. The request encoding header should have the same value ... what's not that easy when using a recent Tomcat: http://issues.apache.org/bugzilla/show_bug.cgi?id=26997
The xsl stylesheets use ISO-8859-1, though.
That's not a problem.
I've also tried setting everything to ISO-8859-1, but
the problem with the href-attributes in html remains. Mozilla Firefox
shows the characters correctly when doing "view source", but if I save the
document on disk and open with ASCII-editor, the encoding is wrong there
with both IE and Mozilla. So maybe it's not a browser problem?
Here's an example:
<a href="��" foo="��">��</a>
becomes:
<a href="%C3%A4%C3%B6" foo="äö">äö</a>
when it should read (I think):
<a href="äö" foo="äö">äö</a>
... follow-up mail:
The URL-encoding is done wrong when serializing to HTML. According to specs "��" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6". This seems to be the problem. So far I've noticed this problem with the HREF-attribute only.
For a test I made a styslesheet that substitutes "�" with "%E4" before serializing to HTML. This works, but it should be done by the serializer, right?
Seems like a Cocoon issue.
If it would be an error at all, it would be a Xalan serializer problem I think. But there were bugs reported on this topic and rejected because of the specs (I think they have the same problems like you):
http://nagoya.apache.org/jira/browse/XALANJ-1412 http://nagoya.apache.org/jira/browse/XALANJ-1548
As I wrote: you simply get different request encodings when sending a form or just clicking <a href=""/>.
Joerg
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
-- Marc Portier http://outerthought.org/ Outerthought - Open Source, Java & XML Competence Support Center Read my weblog at http://blogs.cocoondev.org/mpo/ [EMAIL PROTECTED] [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
-- Marc Portier http://outerthought.org/ Outerthought - Open Source, Java & XML Competence Support Center Read my weblog at http://blogs.cocoondev.org/mpo/ [EMAIL PROTECTED] [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
