Re: Ignore invalid bytes

Simon Kitching 27 Jul 2004 02:39:20 -0000

On Tue, 2004-07-27 at 10:19, Andy Clark wrote:
> Harald Wehr wrote:
> > Is it possible to tell xerces just to ignore these bytes and to go on 
> > parsing the document?
> 
> You really shouldn't ignore this type of error. And even though
> Xerces has a continue-after-fatal-error setting, you are likely
> to get caught in an infinite loop if you use it in this situation.
> 
> > There is no need to display these documents 100 % correctly. A missing 
> > character is acceptable for us in this project rather than chrashing the 
> > whole document with this exception.
> 
> Depending on the primary data in your document, a cheap trick is
> to use a Reader object with the input encoding set to ISO Latin 1
> because it uses the full eight bits in each byte and nothing is
> invalid. Of course, you should realize that every UTF-8 character
> after 127 will be corrupted using this trick.


Well, if you're using java1.4 or later, then presumably you could
implement your own character encoding scheme, register it, then tell the
parser to use that scheme. Your scheme would delegate to the UTF-8
decoder, except that on invalid char it returns "?" or similar.

See java.nio.Charset or java.nio.CharsetDecoder,
java.nio.charset.spi.CharsetProvider.

But this does seem to be a lot of work. Pre-processing the document to
remove the invalid characters is probably easier. Why not write your own
filter around wherever the xml is coming from, and replace the problem
chars before they get fed into the xml parser? I presume you're aware
that the parse methods take input streams as well as filenames; you just
need to make sure the stream is "sanitized".

Regards,

Simon




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ignore invalid bytes

Reply via email to