Re: [i18n] Replacing entities with encoded chars

Antonio Gallardo Sat, 17 Apr 2004 23:44:37 -0700

Hi:

The Joerg answer is OK.
I just added some additional info about the topic.

Joerg Heinicke dijo:
> On 17.04.2004 22:33, Upayavira wrote:
>
>> A few I18N questions:
>>
>> 1) I have some polish text that uses character entities, such as
>> "si&#281;" How can I translate this into a single or double byte
>> character in either ISO-8859-1, ISO-8859-2 or UTF-8?

You can avoid the &#xxx; using UTF-8. UTF-8 allow you to write the
representation of the char directly in the file. For example in Spanish:

I really don't know the &#xxx; syntax for the following chars:

�, �, �, �, �, �, etc.

I just write them directly in the file (as above). It is easier to me. It
is because I use UTF-8 in the XML files. This is the gain.

> They are not translated. While you have the entity representation in the
> XML files, you have characters in Java. Only the serializer decides
> whether it puts them out as character or character entity. In general
> this can't be influenced, but the one or the other serializer might have
> configuration options for this. But at the end (i.e. in the browser or
> where ever) it should work for both the entity and the character as they
> represent the same "thing".

Yep. I recommend to use UTF-8 whenever is posible:
http://marc.theaimsgroup.com/?l=xml-cocoon-users&m=106142759328759&w=2

>> 2) I can set the encoding of a page in the serialiser configuration. How
>> do I deal with the situation where the best encoding depends upon the
>> language, which means that the encoding should be chosen based upon the
>> encoding of a source file?

Again, try to use UTF-8. Is the best bet.

> That's not possible. As written above you have more or less
> encoding-neutral characters in Java (obviously not completely as
> somewhere in the memory they are also just bytes).

Yep, Java uses UTF-8 as the internal representation of Strings.

> But at least they are
> independent on the encoding of the original file. You do not know in
> which encoding the XML file was.

Yes, the parser make the conversion for you. The parser read the @encoding
in:

<?xml version="1.0" encoding="XXX"?>

where XXX is the encoding of the file.

***************************************************************
Note: From the XML specs, if you avoid the @encoding, by default encoding
is UTF-8. Example:

<?xml version="1.0"?>
***************************************************************

You need to be aware also that writing in the XML header the @encoding is
not enough. It is just a declaration. You need to make sure that the
Operating System is using the right encoding while saving the file to
disk. For this purpuse I prefer to use a jEdit - http://www.jEdit.org/
that always tell me the encoding used to read/write the file.

Of course there are other editors that allow you define the encoding.
While begining in Cocoo, I really had nightmares, because the transition
to UTF-8 concided with my first steps and what worked fine in RedHat 7.3
was not OK in RedHat 8. And we changed the OS between the development. The
answer was that RH8 uses UTF-8 as default while RH 7.3 not. The world is
moving to UTF-8 and we need to try to use it everywhere.

                                 -0-

I believe that keeping all the processing pipeline in the same encoding
avoid you problems and is more efficient, since the system don't need to
make conversions between encoding that end in not desired string
representations.

For example:
XML in ISO-8859-1
Serialize in ISO8859-1

In fact we have 2 conversions there:

ISO-8859-1 -> UTF-8 while loading in Java
UTF-8 -> ISO-8859-1 while serializing from Java
Keep in mind Java always use UTF-8 as default
(Here I need to explain a little more. In fact Java in memory render to
UTF-16. That means a 2-bytes for each char in memory.

> You have to decide the serializer's
> encoding only based on the possible character range. If it's strewed
> over the ISO char sets better use UTF-8 in general. Another option would
> be to use a selector based on user's locale which chooses the serializer
> (with a specific encoding).

Another issue you need to keep in mind is that Cocoon is a servlet and the
servlet container (Tomcat. jetty) have the "last word". You need to
"synchronize it too". In particular there you will find 2 params in your
web.xml:

<!--
  Set encoding used by the container. If not set the ISO-8859-1 encoding
  will be assumed.
-->

<init-param>
  <param-name>container-encoding</param-name>
  <param-value>utf-8</param-value>
</init-param>

<!--
  Set form encoding. This will be the character set used to decode request
  parameters. If not set the ISO-8859-1 encoding will be assumed.
-->

<init-param>
  <param-name>form-encoding</param-name>
  <param-value>utf-8</param-value>
</init-param>

You will find problem if Cocoon will serialize ISO-8859-1 and your servlet
UTF-8. The same issue can be show even in httpd servers when pages are
saved on the disk using ISO-8859-1 and your httpd server is setted to use
UTF-8.

>> Thanks for your help!

Me too.

> Those i18n ignorant English men! ;-)

lol. Not English man, but still, I am! :-DD

Best Regards,

Antonio Gallardo

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [i18n] Replacing entities with encoded chars

Reply via email to