All,
On govdocs1, the xml parser's exceptions accounted for nearly a quarter of
all thrown exceptions at one point (Tika 1.7ish). Typically, a file was
mis-identified as xml when in fact it was sgml or some other text based file
with some markup that wasn't meant to be xml.
For kicks, I switched the config to use the HtmlParser for files identified
as xml. This got rid of the exceptions, but the content was quite different
(ballpark 6k files out of 35k files had similarity < 0.95) mostly because of
elisions "the quick" -> "thequick", and I assume this is across tags...
So, is there a way to make the XMLParser more lenient? Or is there a way to
configure the HtmlParser to add spaces for non-html tags?
Or, is there a better solution?
Thank you!
Best,
Tim