----- Original Message ----- From: Mike Pogue <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, January 28, 2000 9:23 PM Subject: Re: xml encodings, java
Hello! And thank you for your perfect answer! > I'm not sure how Cocoon works, but let me summarize how encodings work > in general. > > The XML spec says (I'm paraphrasing) that if you use an encoding other > than UTF-8 or UTF-16, then you must specify the encoding in the initial > line, like this: > > <?xml version="1.0" encoding="koi8"?> > > The parser uses the first 4 bytes of the file to determine what the > encoding of the first line is. You cannot put any characters before the > "<?xm". A byte order mark is required, in the case of UTF-16, which > specifies BE or LE. > > The first 4 bytes are used to guess at the encoding of the first line. > The first line is read, and if an encoding clause is present, the > encoding is switched to that one for the rest of the file. In the Java > parser, the underlying JDK is called to instantiate the encoder. > In most cases, the NAME of the encoding is NOT the Java name -- it's the > MIME or IANA name. There is a switch on the parser to permit Java > encoding names, too, but using this switch can result in non-standard > XML (XML that cannot be read on other parsers). I do not recommend > doing this. > > Now, if the first line is NOT present, then the parser assumes that the > file is UTF-8. As I understand parser reencode xml from input encoding to utf8. And this is not work for me because my JVM don't have koi8 encoding installed? [skip] > > > Why not to work with xml content like with raw data, only processing tags? > > The XML spec says that arbitrary binary data is NOT allowed in an XML > document. It must be data in a recognized encoding, and every character > is checked to make sure that it is in a legal range. > Hmm. I understand that Xerces is more than only parser for Cocoon, but , imho, there is problem in Xerces and Cocoon interaction. I always use valid documents for Cocoon and I don't think that publishing is time for validation... > On output, you need to make sure that your files follow all these > rules. An easy way to check this, is to use the parser itself. Feed > your input or output file through one of the parser sample programs, and > let the parser tell you which characters are wrong. It will tell you > which characters are illegal, and what line they're on. > > This method makes it a LOT easier to track down encoding-related > problems. > Shure. But, imho, this not right way for publishing engine. > Another way to eliminate a lot of problems is to use UTF-8 as an output > encoding. This is sometimes not possible, but UTF-8 does contain all > the Cyrillic characters of koi8 (as far as I know). And, your resulting > XML will be portable to more environments, because ALL XML parsers are > required to understand UTF-8. I use XML only for publishing. And I use Cocoon with Russian Apache, Apache with patches by Alex Tutubalin, which is very popular in Russia. Main feature of RA is reencoding documents to client (browser ) encoding. We have at least 4 encodings for cyrillic characters (5 with unicode) and I decided to use only koi8 on my Linux servers, because unicode support is too weak now. So I need to Cocoon read documents in koi8 encoding ( or any encoding) and output koi8 (or any other, but how to say to Cocoon this? ). How can I do my work without patching new versions of Xerces? Dmitry Melekhov http://www.aspec.ru/~dm 2:5050/[EMAIL PROTECTED] P.S. Looks like this list is not for such questions, but there is no replies in Cocoon users list :( Point me, please, to right list.
