RE: Problem extracting Japanese characters in straight SAX parse

Bannister, Hilary J 26 Jun 2003 11:17:02 -0000

You need to use InputStreamReader and OutputStreamWriter and specify the
encoding to be "UTF8" when constructing the reader or writer.


For example for reading:

   StringBuffer buffer = new StringBuffer();
   try {
       FileInputStream fis = new FileInputStream("test.txt");
       InputStreamReader isr = new InputStreamReader(fis, "UTF8");
       Reader in = new BufferedReader(isr);
       int ch;
       while ((ch = in.read()) > -1) {
          buffer.append((char)ch);
       }
       in.close();
   } catch (IOException e) {
       e.printStackTrace();
       return null;
   }
   parser.parse(buffer.toString());

Beware trying to output the strings to files as this must use the
outputStreamWriter or else you will lose the encoding and then you look at
them with notepad or ie which have the relevant fonts loaded.

static void writeOutput(String str) {

    try {
        FileOutputStream fos = new FileOutputStream("test.txt");
        Writer out = new OutputStreamWriter(fos, "UTF8");
        out.write(str);
        out.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

We have just painfully solved this problem.

Regards
Hilary

-----Original Message-----
From: Hondros, Constantine [mailto:[EMAIL PROTECTED]
Sent: 26 June 2003 12:04
To: '[EMAIL PROTECTED]'
Subject: RE: Problem extracting Japanese characters in straight SAX
parse


Thanks for your help, 
I've done as you suggest, and now initiate the parse thus :

        java.io.FileInputStream f = new java.io.FileInputStream(tocFile);
        InputSource source = new InputSource(f);
        parser.parse(source);

However, when I receive attribute text via the SAX method
Attributes.getValue() in the startElement method, all the multi-byte UTF-8
characters have been converted to a single-byte, HEX 3F, the '?' character!

I'm obtaining the HEX dump by appending the returned Strings to a
StringBuffer and writing to a file. Unless I'm accidentally flattening out
Unicode characters at this point, SAX itself is must be munging them, as far
as I can tell.

I don't get any errors when the input file is read, and I have confirmed
that its encoding is valid UTF-8 or UTF-16 (I've tried both encodings).

Is there anything else I can try?

Thanks ;-)

-----Original Message-----
From: Michael Glavassevich [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 25, 2003 7:07 PM
To: '[EMAIL PROTECTED]'
Subject: Re: Problem extracting Japanese characters in straight SAX
parse


Hello Constantine,

It looks like your problem is with FileReader. It assumes the default
character encoding for your system, which may be UTF-8, EBCDIC, or
something else. When you pass a Reader to the parser, any available
encoding information isn't used because the parser doesn't read from the
underlying byte stream. It only sees the transcoded characters.

Unless you have a good reason against it, you should let the parser detect
the encoding itself. For instance you could create a FileInputStream
instead, and set this on your InputSource.

Hope that helps.

On Wed, 25 Jun 2003, Hondros, Constantine wrote:

> I'm parsing a UTF-16 Japanese XML file with Xerces 2.4 with a simple class
> that extends DefaultHandler. I am just trying to write out certain CDATA
> attribute values (these are the Japanese characters)  into a file : very
> simple, supposedly.
>
> Problem is, there is some sort of encoding mischief going on , as the
UTF-16
> Japanese characters in the CDATA attributes are coming out horribly
mangled.
>
> This is how I am initiating the parse :
>
>       XMLReader parser =
> XMLReaderFactory.createXMLReader(DEFAULT_PARSER_NAME);
>       parser.setFeature(VALIDATION_FEATURE_ID, false);
>       parser.setContentHandler(this);
>       parser.setErrorHandler(this);
>       parser.setEntityResolver(new DTDResolver());
>       FileReader reader = new FileReader(tocFile);
>             InputSource source = new InputSource(reader);
>             source.setEncoding("UTF-16");
>             source.setSystemId(tocFile.getAbsolutePath());
>       parser.parse(source);
>
> and this (simplified) is how I am grabbing the Japanese characters (I am
> appending them to a StringBuffer) :
>
>       public void startElement(String uri, String local, String qname,
> Attributes attrs) throws SAXException {
>                   myStringBuffer.append(attrs.getValue("myattribute"));
>
> So two questions : should I be using a FileReader when I initiate the
parse
> or some other object of the IO family?
> And : is it naive to expect the characters to pop off the attrs parameter
> without having to do some extra work?
>
> Any hints greatly appreciated,
>
> Constantine Hondros
>
>
>
>
> --
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

--------------------
Michael Glavassevich
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-- 
The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it. 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Problem extracting Japanese characters in straight SAX parse

Reply via email to