Slight correction... The BOM is required for UTF-16 only if the XMLDecl
line (<?xml...) is not present. If the XMLDecl is present then we can
figure it out from that (though a BOM can also still be present.)
----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
Mike Pogue <[EMAIL PROTECTED]> on 01/28/2000 09:23:18 AM
Please respond to [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
cc:
Subject: Re: xml encodings, java
I'm not sure how Cocoon works, but let me summarize how encodings work
in general.
The XML spec says (I'm paraphrasing) that if you use an encoding other
than UTF-8 or UTF-16, then you must specify the encoding in the initial
line, like this:
<?xml version="1.0" encoding="koi8"?>
The parser uses the first 4 bytes of the file to determine what the
encoding of the first line is. You cannot put any characters before the
"<?xm". A byte order mark is required, in the case of UTF-16, which
specifies BE or LE.
The first 4 bytes are used to guess at the encoding of the first line.
The first line is read, and if an encoding clause is present, the
encoding is switched to that one for the rest of the file. In the Java
parser, the underlying JDK is called to instantiate the encoder.
In most cases, the NAME of the encoding is NOT the Java name -- it's the
MIME or IANA name. There is a switch on the parser to permit Java
encoding names, too, but using this switch can result in non-standard
XML (XML that cannot be read on other parsers). I do not recommend
doing this.
Now, if the first line is NOT present, then the parser assumes that the
file is UTF-8.
There are ways to "override" this behavior, and basically "lie" to the
parser about the encoding, but unless you know what you're doing, I'd
recommend against doing so -- it's dangerous.
On top of that, not all characters are allowed in XML, bringing us to
your next question:
> Why not to work with xml content like with raw data, only processing
tags?
The XML spec says that arbitrary binary data is NOT allowed in an XML
document. It must be data in a recognized encoding, and every character
is checked to make sure that it is in a legal range.
On output, you need to make sure that your files follow all these
rules. An easy way to check this, is to use the parser itself. Feed
your input or output file through one of the parser sample programs, and
let the parser tell you which characters are wrong. It will tell you
which characters are illegal, and what line they're on.
This method makes it a LOT easier to track down encoding-related
problems.
Another way to eliminate a lot of problems is to use UTF-8 as an output
encoding. This is sometimes not possible, but UTF-8 does contain all
the Cyrillic characters of koi8 (as far as I know). And, your resulting
XML will be portable to more environments, because ALL XML parsers are
required to understand UTF-8.
Hope this helps!
Mike
Dmitry Melekhov wrote:
>
> ----- Original Message -----
> From: Mike Pogue <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, January 28, 2000 8:33 PM
> Subject: Re: xml encodings, java
>
> > The code you have below is a clever workaround, but ultimately, you
want
> > to use a JVM that has the encoding support built-in.
> >
> > So, I'd suggest you try to use the IBM 1.1.8 JVM. It's fairly
reliable,
> > scalable, and I think it has the encoding support you are looking for.
> > (Of course, I am biased in this! :-)
> >
>
> OK. I just tried IBM jdk, it work exactly as blackdown in this case.
>
> But I wont to know how must xerces (or may be this is cocoon problem,
> I don't know) works with encodings. Why there is code which I comment
out?
> Why not to work with xml content like with raw data, only processing
tags?
> How must it works if I set encoding in xml document and is it input
> (i.e. what I have in xml) or output (i.e. what cocoon send to browser)
> encoding, etc? I want to understand how it works ! :)
>
> Dmitry Melekhov
> http://www.aspec.ru/~dm
> 2:5050/[EMAIL PROTECTED]
>
> > Mike
> >
> >
> > Dmitry Melekhov wrote:
> > >
> > > Hello!
> > >
> > > I'm not shure that tis list is write place
> > > for this question. If I do mistake, I'm sorry!
> > >
> > > Question is Cocoon related and about how xerces must
> > > works with encodings.
> > >
> > > I write my xml documents in koi8 encoding,
> > > but set I encoding or not I always see ???? in browser instead of
> > > 8 bit characters.
> > > Taras Shumeyko pointed me that this is formatter problem and
> > > that problem is in org.apache.xml.serialize.BaseMarkupSerializer
> > > in function protected String escape( String source )
> > >
> > > I changed it- remove all reecodings from it and now
> > > I have Cocoon and Xerces works OK.
> > > Here is my variant of function:
> > >
> > > protected String escape( String source )
> > > {
> > > StringBuffer result;
> > > int i;
> > > char ch;
> > > String charRef;
> > >
> > > result = new StringBuffer( source.length() );
> > > for ( i = 0 ; i < source.length() ; ++i ) {
> > > ch = source.charAt( i );
> > > // If the character is not printable, print as character
> > > reference.
> > > // Non printables are below ASCII space but not tab or
line
> > > // terminator, ASCII delete, or above a certain Unicode
> > > threshold.
> > > // if ( ( ch < ' ' && ch != '\t' && ch != '\n' && ch != '\r'
)
> > > ||
> > > // ch > _lastPrintable || ch == 0xF7 )
> > > // result.append( "&#" ).append( Integer.toString(
ch )
> > > ).append( ';' );
> > > // else {
> > > // If there is a suitable entity reference for
this
> > > // character, print it. The list of available
entity
> > >
> > > // references is almost but not identical between
> > > // XML and HTML.
> > > // charRef = getEntityRef( ch );
> > > // if ( charRef == null )
> > > result.append( ch );
> > > // else
> > > // result.append( '&' ).append( charRef
).append(
> > > ';' );
> > > // }
> > > }
> > > return result.toString();
> > > }
> > >
> > > But this is dirty hack.
> > >
> > > I want to understand how must Xerces treat encodings and why
> > > it don't wokrs now.
> > >
> > > --
> > > Dmitry Melekhov
> > > http://www.aspec.ru/~dm
> > > 2:5050/[EMAIL PROTECTED]
> > >
> > > P.S.
> > > My java platform is blackdown jdk 1.1.7 for Linux x86
> >
> >