[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595015#action_12595015 ]
Grant Ingersoll commented on SOLR-284: -------------------------------------- I think Tika will actually take less effort, as you only need one interface, as I understand it. You don't need separate handlers for each type, we just need to write the interface between Solr and Tika. Nutch is already using Tika. +1 Yes, someone else maintains the code. We just maintain the interface and upgrade when appropriate. well, metadata makes for nice fields to sort, filter and facet on, right? I think it is more likely that you will see Nutch integration w/ Solr (in fact, there is already a patch for it), but yeah, I think it makes sense to consider Solr as a sink for any crawler. Some of this also overlaps w/ the Data Import Request Handler on SOLR-469. I don't think we want to get Solr into the crawling game, but we also shouldn't prevent it from playing nicely with crawlers (not saying it doesn't already) -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ > Parsing Rich Document Types > --------------------------- > > Key: SOLR-284 > URL: https://issues.apache.org/jira/browse/SOLR-284 > Project: Solr > Issue Type: New Feature > Components: update > Affects Versions: 1.3 > Reporter: Eric Pugh > Fix For: 1.3 > > Attachments: libs.zip, rich.patch, rich.patch, rich.patch, > source.zip, test-files.zip, test-files.zip, test.zip > > > I have developed a RichDocumentRequestHandler based on the CSVRequestHandler > that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into > Solr. > There is a wiki page with information here: > http://wiki.apache.org/solr/UpdateRichDocuments > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.