Re: character encodings with Xerces

Jonathan Whitall 15 May 2003 23:59:15 -0000

> If your application is reading the UTF-8 bytes
> coming
> from the database and want to create, for example,
> DOM
> text nodes, then you need to convert the bytes into
> Java Strings to create the nodes. But this is easy
> in
> code.
> 
> Don't confuse the input/output encoding of a
> document
> with the encoding of the internal storage of those
> characters. Internally, Java stores everything in
> two
> byte Unicode characters. Therefore, Xerces does NOT
> create nodes in UTF-8 or ISO Latin-1 byte sequences.
> 
> The parser only reads an XML document into an
> internal
> format (e.g. SAX or DOM). For writing the document
> back
> to a file (or stream), you would use a serializer
> with
> the intended output encoding. The Xerces package
> comes
> with serializers for this purpose.
> 
> Does this answer your question?


Hi,

Yes, I am using DOM.  I did play around with
XMLSerializer and was able to set the outbound
encoding to Latin-1 without any problems.  The
characters in question that weren't in the bounds of
my outbound encoding got converted to entity
representation (e.g. &#350;).  This is certainly
better than sending the actual Unicode character, but
what I really want to do is filter out all of these
characters that don't fall within the bounds of
Latin-1.  Is there a way to scan and inspect all of
the entities in a particular document, or to
automatically filter them out on serialization?

Thanks,
Jonathan

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: character encodings with Xerces

Reply via email to