[ https://issues.apache.org/jira/browse/TIKA-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763783#action_12763783 ]
Benson Margulies commented on TIKA-303: --------------------------------------- Here's a patch: diff -r apache-tika-0.4/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java apache-tika-0.4-mod/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java 101a102,103 > private boolean lazyStarted; > 115a118 > started = true; 140a144 > lazyStarted = true; 155,157c159,163 < endElement("body"); < endElement("html"); < endPrefixMapping(""); --- > if (lazyStarted) { > endElement("body"); > endElement("html"); > endPrefixMapping(""); > } Yes it's contributed to the ASF. I'm a member. > XHTMLContentHandler mishandles headers > -------------------------------------- > > Key: TIKA-303 > URL: https://issues.apache.org/jira/browse/TIKA-303 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.4 > Reporter: Benson Margulies > > XHTMLContentHandler.startDocument does not note that it has been called. So > then lazyStartDocument will happen and embed an extra layer of > head/title/body processing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.