Hi:Thanks for this Antonio. It all helps to make me a slightly less ignorant Englishman ;-)
The Joerg answer is OK.
I just added some additional info about the topic.
But my question is: "I have a file that contains entity references. I want to replace it with direct characters, e.g. in UTF-8. How do I do this?" That is, this question really has nothing to do with Cocoon specifically. I want to change the format of my source files.
Regards, Upayavira
Joerg Heinicke dijo:
On 17.04.2004 22:33, Upayavira wrote:
A few I18N questions:
1) I have some polish text that uses character entities, such as
"się" How can I translate this into a single or double byte
character in either ISO-8859-1, ISO-8859-2 or UTF-8?
You can avoid the &#xxx; using UTF-8. UTF-8 allow you to write the representation of the char directly in the file. For example in Spanish:
I really don't know the &#xxx; syntax for the following chars:
�, �, �, �, �, �, etc.
I just write them directly in the file (as above). It is easier to me. It is because I use UTF-8 in the XML files. This is the gain.
They are not translated. While you have the entity representation in the
XML files, you have characters in Java. Only the serializer decides
whether it puts them out as character or character entity. In general
this can't be influenced, but the one or the other serializer might have
configuration options for this. But at the end (i.e. in the browser or
where ever) it should work for both the entity and the character as they
represent the same "thing".
Yep. I recommend to use UTF-8 whenever is posible: http://marc.theaimsgroup.com/?l=xml-cocoon-users&m=106142759328759&w=2
2) I can set the encoding of a page in the serialiser configuration. How
do I deal with the situation where the best encoding depends upon the
language, which means that the encoding should be chosen based upon the
encoding of a source file?
Again, try to use UTF-8. Is the best bet.
That's not possible. As written above you have more or less
encoding-neutral characters in Java (obviously not completely as
somewhere in the memory they are also just bytes).
Yep, Java uses UTF-8 as the internal representation of Strings.
But at least they are
independent on the encoding of the original file. You do not know in
which encoding the XML file was.
Yes, the parser make the conversion for you. The parser read the @encoding in:
<?xml version="1.0" encoding="XXX"?>
where XXX is the encoding of the file.
*************************************************************** Note: From the XML specs, if you avoid the @encoding, by default encoding is UTF-8. Example:
<?xml version="1.0"?> ***************************************************************
You need to be aware also that writing in the XML header the @encoding is not enough. It is just a declaration. You need to make sure that the Operating System is using the right encoding while saving the file to disk. For this purpuse I prefer to use a jEdit - http://www.jEdit.org/ that always tell me the encoding used to read/write the file.
Of course there are other editors that allow you define the encoding. While begining in Cocoo, I really had nightmares, because the transition to UTF-8 concided with my first steps and what worked fine in RedHat 7.3 was not OK in RedHat 8. And we changed the OS between the development. The answer was that RH8 uses UTF-8 as default while RH 7.3 not. The world is moving to UTF-8 and we need to try to use it everywhere.
-0-
I believe that keeping all the processing pipeline in the same encoding avoid you problems and is more efficient, since the system don't need to make conversions between encoding that end in not desired string representations.
For example: XML in ISO-8859-1 Serialize in ISO8859-1
In fact we have 2 conversions there:
ISO-8859-1 -> UTF-8 while loading in Java UTF-8 -> ISO-8859-1 while serializing from Java Keep in mind Java always use UTF-8 as default (Here I need to explain a little more. In fact Java in memory render to UTF-16. That means a 2-bytes for each char in memory.
You have to decide the serializer's
encoding only based on the possible character range. If it's strewed
over the ISO char sets better use UTF-8 in general. Another option would
be to use a selector based on user's locale which chooses the serializer
(with a specific encoding).
Another issue you need to keep in mind is that Cocoon is a servlet and the servlet container (Tomcat. jetty) have the "last word". You need to "synchronize it too". In particular there you will find 2 params in your web.xml:
<!-- Set encoding used by the container. If not set the ISO-8859-1 encoding will be assumed. -->
<init-param> <param-name>container-encoding</param-name> <param-value>utf-8</param-value> </init-param>
<!-- Set form encoding. This will be the character set used to decode request parameters. If not set the ISO-8859-1 encoding will be assumed. -->
<init-param> <param-name>form-encoding</param-name> <param-value>utf-8</param-value> </init-param>
You will find problem if Cocoon will serialize ISO-8859-1 and your servlet UTF-8. The same issue can be show even in httpd servers when pages are saved on the disk using ISO-8859-1 and your httpd server is setted to use UTF-8.
Thanks for your help!
Me too.
Those i18n ignorant English men! ;-)
lol. Not English man, but still, I am! :-DD
Best Regards,
Antonio Gallardo
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
