[jira] Commented: (SOLR-284) Parsing Rich Document Types

Grant Ingersoll (JIRA) Wed, 07 May 2008 12:47:20 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595015#action_12595015
 ]


Grant Ingersoll commented on SOLR-284:
--------------------------------------




I think Tika will actually take less effort, as you only need one  
interface, as I understand it.  You don't need separate handlers for  
each type, we just need to write the interface between Solr and Tika.

Nutch is already using Tika.


+1


Yes, someone else maintains the code.  We just maintain the interface  
and upgrade when appropriate.


well, metadata makes for nice fields to sort, filter and facet on,  
right?


I think it is more likely that you will see Nutch integration w/ Solr  
(in fact, there is already a patch for it), but yeah, I think it makes  
sense to consider Solr as a sink for any crawler.

Some of this also overlaps w/ the Data Import Request Handler on  
SOLR-469.   I don't think we want to get Solr into the crawling game,  
but we also shouldn't prevent it from playing nicely with crawlers  
(not saying it doesn't already)


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to