----- Original Message -----
From: Mike Pogue <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, January 28, 2000 9:23 PM
Subject: Re: xml encodings, java

Hello!

And thank you for your perfect answer!


> I'm not sure how Cocoon works, but let me summarize how encodings work
> in general.
>
> The XML spec says (I'm paraphrasing) that if you use an encoding other
> than UTF-8 or UTF-16, then you must specify the encoding in the initial
> line, like this:
>
> <?xml version="1.0" encoding="koi8"?>
>
> The parser uses the first 4 bytes of the file to determine what the
> encoding of the first line is.  You cannot put any characters before the
> "<?xm".  A byte order mark is required, in the case of UTF-16, which
> specifies BE or LE.
>
> The first 4 bytes are used to guess at the encoding of the first line.
> The first line is read, and if an encoding clause is present, the
> encoding is switched to that one for the rest of the file.  In the Java
> parser, the underlying JDK is called to instantiate the encoder.
> In most cases, the NAME of the encoding is NOT the Java name -- it's the
> MIME or IANA name.  There is a switch on the parser to permit Java
> encoding names, too, but using this switch can result in non-standard
> XML (XML that cannot be read on other parsers).  I do not recommend
> doing this.
>
> Now, if the first line is NOT present, then the parser assumes that the
> file is UTF-8.

As I understand parser reencode xml from input encoding to utf8.
And this is not work for me because my JVM don't have koi8
encoding installed?

[skip]
>
> > Why not to work with xml content like with raw data, only processing
tags?
>
> The XML spec says that arbitrary binary data is NOT allowed in an XML
> document.  It must be data in a recognized encoding, and every character
> is checked to make sure that it is in a legal range.
>

Hmm. I understand that Xerces is more than only parser for Cocoon,
but , imho, there is problem in Xerces and Cocoon interaction.
I always use valid documents for Cocoon and I don't think
that publishing is time for validation...

> On output, you need to make sure that your files follow all these
> rules.  An easy way to check this, is to use the parser itself.  Feed
> your input or output file through one of the parser sample programs, and
> let the parser tell you which characters are wrong.  It will tell you
> which characters are illegal, and what line they're on.
>
> This method makes it a LOT easier to track down encoding-related
> problems.
>

Shure. But, imho, this not right way for publishing engine.

> Another way to eliminate a lot of problems is to use UTF-8 as an output
> encoding.  This is sometimes not possible, but UTF-8 does contain all
> the Cyrillic characters of koi8 (as far as I know).  And, your resulting
> XML will be portable to more environments, because ALL XML parsers are
> required to understand UTF-8.

I use XML only for publishing. And I use Cocoon with Russian Apache,
Apache with patches by Alex Tutubalin, which is very popular in Russia.
Main feature of RA is reencoding documents to client (browser ) encoding.
We have at least 4 encodings for cyrillic characters (5 with unicode) and
I decided to use only koi8 on my Linux servers, because unicode
support is too weak now.
So I need to Cocoon read documents in koi8 encoding ( or any encoding)
and output koi8 (or any other, but how to say to Cocoon this? ).
How can I do my work without patching new versions of Xerces?

Dmitry Melekhov
http://www.aspec.ru/~dm
2:5050/[EMAIL PROTECTED]

P.S.
Looks like this list is not for such questions, but there is no replies in
Cocoon users list :( Point me, please, to right list.


Reply via email to