Title: RE: JAVA: trouble with UTF8 encoding and org.w3c.dom.CharacterData.getData()

If you use a reader, then you're essentially converting those bytes to characters. If you do not specify the encoding, the default encoding is used which might not be utf-8 and hence the behavior of what you're seeing. You might be just better off passing a FileInputStream to the InputSource and not worry about the encodings. The parser will auto detect the encoding and convert the bytes to the right characters. Alternatively, you can just say

parser.parse(fileURI) where 'fileURI' is essentially the uri representation of the file path.

Pradeep

-----Original Message-----
From: SAXESS - Hussayn Dabbous
To: [EMAIL PROTECTED]
Sent: 9/15/2002 1:42 PM
Subject: JAVA: trouble with UTF8 encoding and org.w3c.dom.CharacterData.getData()

Hy, JAVA programmers

I want to read utf8 characters from an XML file using a DOMParser, but
all i get is a set of single bytes. Probably this is a dummies error,
but i don't see the point. Maybe someone can help me ???

I did the following:

1.) I have written a simple XML-file containing utf8 character
encodings:

    +++ begin of file +++++++++++++++++++++++++++++++++++++++++++
    <?xml version="1.0" encoding="UTF-8"?>
    <myxml w="150" h="200" color="FFCCDDEE">
      <text font="Cyberbit Cyberspace" size="13">???</text>
    </myxml>
    +++ end of file +++++++++++++++++++++++++++++++++++++++++++++

    The three characters enclosed in the <text>-tag are in fact three
UTF8 characters.
    when looking at the file with XML-spy, i can see the three
characters.
    when looking at the file with a unix text editor i see 9 bytes in
total there, which
    i have verified to be the correct utf8 encoding. This mail possibly
contains
    only three questionmarks ... ("???")

2.) I read the file using a DOMParser as follows:

    * I create a DOMParser() instance
    * I Create an InputSource(FileReader) instance
    * I create a Document with DOMParser.parse(InputSource)
    * Then i step through the resulting document instance,
      retrieve the Elements, detect the Text, finally
      read Text.getData() to retrieve the textstring.

3.) Now i expect that the text string contains 3 characters, each of
them
    should be a unicode character.
    But all i get is 9 characters, each containing one byte of the utf-8
raw string.
 
i tried encoding="UTF8" but that didn't help.
What's going wrong?

Maybe i should use an InputStream(filename,"UTF-8") instead of a
FileReader instance ??? (that doesn't sound correct for me ..)


any hint would help.
regards, hussayn

--
Dr. Hussayn Dabbous
SAXESS Software Design GmbH
Neuenh�fer Allee 125
50935 K�ln
Telefon: +49-221-56011-0
Fax:     +49-221-56011-20
E-Mail:  [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to