rehashing what I read from Tuomo's setup:
- cocoon-servlet init params are set to have container-encoding unchanged (thus iso_8859_1) like we recommend and form-encoding to utf-8 to make sure his forms can support wide variety of characters
- as a consequence of this last setting (and the wellknown browser-limitation) this means we need to sync the encoding on the serializer to this same utf-8
- because of this setting there is no reason to complain about the resulting HTML, that is full of utf-8 encoding, no need to refer to specs or blame cocoon: xml serialization was requested to use utf-8 so it does (even xalan does its work here I suppose)
now, what goes wrong?
well, I had planned to get into this during gt2004s hackathon but got distracted on other issues. Lacking the experience of the in depth debugging session I can't really do more then express my current 'suspicions'
(as stated in the thread above)
we've done quite a good job at solving the issue regarding encodings of request-parameters and even extended the servlet 2.3 new insights in doing so (setRequestEncoding()) to support even 2.2 containers
however, one important part of the request object set of getters is escaping this: the URL (and some of its derived 'paths' as well I assume)
This explains why encoding in form-request params gets fixed correctly, but the url itself remains broke --> consequence:
- you can't link to non-latin-char-urls but you can pass non-latin-request-params
in more cocoon detail this means you can't expect cocoon matchers to get correctly triggered by non-latin-urls as well as you can't automount sitemaps in directories with non-latin-only-names...
(or read resources with non-latin-only-names as the original post of the other thread was about)
Suggestion:
1. do some tests to verify above and list them as known limitations on appropriate wikis. --> tell about the two workarounds:
a/ to avoid non-latin urls (even if w3c says all urls should be utf-8 encoded)
b/ use jetty, set org.mortbay.util.URI.charset property and then DO change the cocoon 'container-encoding' param accordingly
2. (assuming my analysis is correct and gets confirmed by the tests) extend our http-wrapping-encoding-fix to include the urls and paths as well (using the tests as a way to verify the success of this)
3. start the crusade for the abolishment of all encodings but utf-8!
The time consuming part here is jamming together an easy deployable testsuite (zip with automount sitemap and all needed stuff inside) covering the various aspects... would be cool if somebody else could be doing that...
regards, -marc=
Joerg Heinicke wrote:
On 29.10.2004 08:44, Tuomo L wrote:
We're having some serious encoding problems. This happens only with the @href attributes in html, when using characters like �, � and � (in Finnish alphabet). Form encoding works just fine. I've gone through all the threads concerning encoding (other people having encoding problems too). No luck so far. Is this still an issue in Cocoon? Could someone please tell what's wrong?
What's the page encoding? Forms work like expected? Just the links don't work? This normally points to a different page encoding than UTF-8 as link requests are encoded in UTF-8 while form requests are encoded in page encoding. I don't think it is a Cocoon issue.
First a link about all the encodings: http://wiki.apache.org/cocoon/RequestParameterEncoding (mostly written by Bruno).
According to IE, the page encoding is set to UTF-8. The
container-encoding and form-encoding in web.xml (Tomcat) are set to UTF-8.
The container-encoding should not be touched at all and remain ISO-8859-1.
HTMLSerializer is set to use UTF-8 (mime-type="text/html; charset=utf-8") and has the parameter <encoding>UTF-8</encoding>.
This should result in <meta http-equiv="Content-Type" content="text/html;charset=utf-8">. The request encoding header should have the same value ... what's not that easy when using a recent Tomcat: http://issues.apache.org/bugzilla/show_bug.cgi?id=26997
The xsl stylesheets use ISO-8859-1, though.
That's not a problem.
I've also tried setting everything to ISO-8859-1, but
the problem with the href-attributes in html remains. Mozilla Firefox
shows the characters correctly when doing "view source", but if I save the
document on disk and open with ASCII-editor, the encoding is wrong there
with both IE and Mozilla. So maybe it's not a browser problem?
Here's an example:
<a href="��" foo="��">��</a>
becomes:
<a href="%C3%A4%C3%B6" foo="äö">äö</a>
when it should read (I think):
<a href="äö" foo="äö">äö</a>
... follow-up mail:
The URL-encoding is done wrong when serializing to HTML. According to specs "��" should become "%E4%F6" when encoded, not "%C3%A4%C3%B6". This seems to be the problem. So far I've noticed this problem with the HREF-attribute only.
For a test I made a styslesheet that substitutes "�" with "%E4" before serializing to HTML. This works, but it should be done by the serializer, right?
Seems like a Cocoon issue.
If it would be an error at all, it would be a Xalan serializer problem I think. But there were bugs reported on this topic and rejected because of the specs (I think they have the same problems like you):
http://nagoya.apache.org/jira/browse/XALANJ-1412 http://nagoya.apache.org/jira/browse/XALANJ-1548
As I wrote: you simply get different request encodings when sending a form or just clicking <a href=""/>.
Joerg
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
-- Marc Portier http://outerthought.org/ Outerthought - Open Source, Java & XML Competence Support Center Read my weblog at http://blogs.cocoondev.org/mpo/ [EMAIL PROTECTED] [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
