xml vs html parser

Allison, Timothy B. Tue, 16 Jun 2015 06:29:42 -0700

All,

  On govdocs1, the xml parser's exceptions accounted for nearly a quarter of 
all thrown exceptions at one point (Tika 1.7ish).  Typically, a file was 
mis-identified as xml when in fact it was sgml or some other text based file 
with some markup that wasn't meant to be xml.


  For kicks, I switched  the config to use the HtmlParser for files identified 
as xml.  This got rid of the exceptions, but the content was quite different 
(ballpark 6k files out of 35k files had similarity < 0.95) mostly because of 
elisions "the quick" -> "thequick", and I assume this is across tags...

  So, is there a way to make the XMLParser more lenient?  Or is there a way to 
configure the HtmlParser to add spaces for non-html tags?

  Or, is there a better solution?



     Thank you!



              Best,



                 Tim

xml vs html parser

Reply via email to