[ https://issues.apache.org/jira/browse/TIKA-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-377. -------------------------------- Resolution: Fixed Fix Version/s: 0.7 Assignee: Jukka Zitting In cases like this the Tika type detection code is fooled into thinking that the document is XML, and obviously any draconian XML parser will reject such documents. In revisions 908554 and 908560 I added some more heuristics to Tika for better detecting such tag soup HTML. With these changes the attached test document is correctly recognized as HTML and parsed with the lenient HTML parser. > Error parsing HTML partial with AutoDetect parser > ------------------------------------------------- > > Key: TIKA-377 > URL: https://issues.apache.org/jira/browse/TIKA-377 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.6 > Reporter: Brett S. > Assignee: Jukka Zitting > Fix For: 0.7 > > Attachments: test.html > > > I get the following error parsing a html file containing a partial HTML > document. > TIKA-237: Illegal SAXException from > org.apache.tika.parser.xml.dcxmlpar...@3a43af > The following conditions need to exist in the file for the error to be thrown: > + A HTML comment before any HTML tags > + More than one top level HTML tag > I will attach a test file to reproduce -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.