[ https://issues.apache.org/jira/browse/TIKA-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569683#action_12569683 ]
Jukka Zitting commented on TIKA-113: ------------------------------------ A solution based on the current code is: Writer writer = ...; XPathParser xpath = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml"); ContentHandler handler = new MatchingContentHandler( new WriteOutContentHandler(writer), xpath.parse("/xhtml:html/xhtml:body//*")); I'm not sure if we should to codify that into a helper class or a method. > Metadata (such as title) should not be part of content > ------------------------------------------------------ > > Key: TIKA-113 > URL: https://issues.apache.org/jira/browse/TIKA-113 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Rida Benjelloun > Fix For: 0.2-incubating > > > Metadata (such as title) is added in the content. In my opinion it would be > preferable that the toString () on the writer return only the content of the > document and not metadata. The metadata are already stored in the metadata > object > Rida. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.