[jira] Commented: (SOLR-284) Parsing Rich Document Types

Chris Harris (JIRA) Tue, 02 Sep 2008 18:08:09 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627882#action_12627882
 ]


Chris Harris commented on SOLR-284:
-----------------------------------

A couple of Tika things:

I glanced at Tika yesterday, and it looks like switching this patch over to it 
wouldn't be too hard. (The only thing half-worthy of note is that 
org.apache.tika.parser.Parser.parse outputs XHTML [via a SAX interface], which 
we would probably then need to turn into plaintext.) I haven't yet looked into 
Eric's code to see if it does anything special that Tika doesn't do.

I also noticed something else, though. Earlier comments say that Nutch uses 
Tika, but when I looked through Nutch trunk this seemed to only sort of be the 
case. In particular, Nutch definitely uses the stuff in the 
org.apache.tika.mime namepsace, to do things like auto-detect content types, 
but it doesn't seem to use the stuff in org.apache.tika.parser to do the actual 
document parsing; instead, it uses its own separate 
org.apache.nutch.parse.Parser class (and subclasses thereof). For example, 
org.apache.nutch.parse.html.HtmlParser does not delegate to 
org.apache.tika.parser.html.HtmlParser but rather does its own direct 
manipulation of the tagsoup and/or nekohtml libraries. (Things are similar with 
the Nutch PDF parser.) Nor does there seem to be an alternative class along the 
lines of 
org.apache.nutch.parse.TikaBasedParserThatCanParseLotsOfDifferentContentTypesIncludingHtml.
 And the string "org.apache.tika.parser" doesn't seem to occur in the Nutch 
source.

I'm wondering if anyone knows why Nutch does not seem to make use of all of 
Tika's functionality. Are they planning to switch everything over to Tika 
eventually?


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to