Re: character encodings with Xerces

Andy Clark 15 May 2003 19:11:42 -0000

Jonathan Whitall wrote:

I was wondering if Xerces can convert from one text
encoding to another specified one on the fly.  I have
some data that is stored in UTF-8 in a database, and I
want to be able to create text nodes which are in the
set of Latin-1.  If I pass UTF-8 to, say, the creator
of a text node, can it convert this automatically, or
do I have to lop off the bytes that I don't want
manually?


If your application is reading the UTF-8 bytes coming
from the database and want to create, for example, DOM
text nodes, then you need to convert the bytes into
Java Strings to create the nodes. But this is easy in
code.

Don't confuse the input/output encoding of a document
with the encoding of the internal storage of those
characters. Internally, Java stores everything in two
byte Unicode characters. Therefore, Xerces does NOT
create nodes in UTF-8 or ISO Latin-1 byte sequences.

The parser only reads an XML document into an internal
format (e.g. SAX or DOM). For writing the document back
to a file (or stream), you would use a serializer with
the intended output encoding. The Xerces package comes
with serializers for this purpose.

Does this answer your question?

--
Andy Clark * [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: character encodings with Xerces

Reply via email to