On 29/10/12 15:56, Olivier Rossel wrote:
Sometimes my CONSTRUCTs retrieve wrongly encoded UTF-8 content.
When Xerces/Jena parses such XML data, it returns an XML parsing error.
Is it a common issue?
Could we imagine a workaround, so the parsing does not fail on UTF-8
encoding errors?
May be preparse and fix any UTF-8 inconsistencies before the XML parsing...
The conversion from bytes to chars is done inside Xerces and is not
recoverable.
Testing first is better - there is a command riotcmd.utf8 that checks a
file.
(The non-RDF/XML parsers use java conversion but the issue remains -
it's not recoverable albeit because the standard decoders buffer and
don't say where the encoding problem was).
Andy