[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627882#action_12627882 ]
Chris Harris commented on SOLR-284: ----------------------------------- A couple of Tika things: I glanced at Tika yesterday, and it looks like switching this patch over to it wouldn't be too hard. (The only thing half-worthy of note is that org.apache.tika.parser.Parser.parse outputs XHTML [via a SAX interface], which we would probably then need to turn into plaintext.) I haven't yet looked into Eric's code to see if it does anything special that Tika doesn't do. I also noticed something else, though. Earlier comments say that Nutch uses Tika, but when I looked through Nutch trunk this seemed to only sort of be the case. In particular, Nutch definitely uses the stuff in the org.apache.tika.mime namepsace, to do things like auto-detect content types, but it doesn't seem to use the stuff in org.apache.tika.parser to do the actual document parsing; instead, it uses its own separate org.apache.nutch.parse.Parser class (and subclasses thereof). For example, org.apache.nutch.parse.html.HtmlParser does not delegate to org.apache.tika.parser.html.HtmlParser but rather does its own direct manipulation of the tagsoup and/or nekohtml libraries. (Things are similar with the Nutch PDF parser.) Nor does there seem to be an alternative class along the lines of org.apache.nutch.parse.TikaBasedParserThatCanParseLotsOfDifferentContentTypesIncludingHtml. And the string "org.apache.tika.parser" doesn't seem to occur in the Nutch source. I'm wondering if anyone knows why Nutch does not seem to make use of all of Tika's functionality. Are they planning to switch everything over to Tika eventually? > Parsing Rich Document Types > --------------------------- > > Key: SOLR-284 > URL: https://issues.apache.org/jira/browse/SOLR-284 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Eric Pugh > Fix For: 1.4 > > Attachments: libs.zip, rich.patch, rich.patch, rich.patch, > rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, > un-hardcode-id.diff > > > I have developed a RichDocumentRequestHandler based on the CSVRequestHandler > that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into > Solr. > There is a wiki page with information here: > http://wiki.apache.org/solr/UpdateRichDocuments > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.