Re: Surviving a UTF-8 parsing error.

Dan B. Wed, 31 Oct 2012 09:43:00 -0700

Andy Seaborne wrote:

On 29/10/12 15:56, Olivier Rossel wrote:

Sometimes my CONSTRUCTs retrieve wrongly encoded UTF-8 content.
When Xerces/Jena parses such XML data,  it returns an XML parsing error.


Is it a common issue?
Could we imagine a workaround, so the parsing does not fail on UTF-8
encoding errors?

May be preparse and fix any UTF-8 inconsistencies before the XML parsing...


The conversion from bytes to chars is done inside Xerces and is not recoverable.

Testing first is better - there is a command riotcmd.utf8 that checks a file.

(The non-RDF/XML parsers use java conversion but the issue remains - it's not 
recoverable albeit because the standard decoders buffer and don't say where the 
encoding problem was).


Also, note that rejecting invalid UTF-8 sequences is recommended
(required, actually, in some specifications) for security.

(You don't want some input validator and a later input processing
interpreting invalid UTF-8 byte sequences differently, so the usual
rule is that an invalid UTF-8 byte sequence must result in error.)


Daniel

Re: Surviving a UTF-8 parsing error.

Reply via email to