[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509683 ]
Eric Pugh commented on SOLR-284: -------------------------------- So, I was not attempting to "boil the ocean" and provide the ultimate solution. Our need was just to take all the raw text and index it in a field, and pass in a bunch of other data fields to be indexed. We are parsing a large number of unstructured documents, that may or may not have common fields populated, but fortunately we don't really need them. Our users aren't searching by author, but by content. I think there are only 5 additional libraries, and one (poi-scratchpad) may be able to be removed... Yonik also mentioned using Tika, as a framework for creating a common interface to these types of rich documents, but Tika is still in incubation and has no code in it! I originally had separate handlers for each data type, and that was really icky, so I condensed it into the RichDocumentRequestHandler. I could also merge in the CSVRequestHandler into it as well, by just taking out the logic for parsing CSV and putting it into a CSVParser. However, the CSVRequestHandler has very complex and rich semantics that these unstructured documents don't really need. > Parsing Rich Document Types > --------------------------- > > Key: SOLR-284 > URL: https://issues.apache.org/jira/browse/SOLR-284 > Project: Solr > Issue Type: New Feature > Components: update > Affects Versions: 1.3 > Reporter: Eric Pugh > Fix For: 1.3 > > Attachments: libs.zip, rich.patch, test-files.zip > > > I have developed a RichDocumentRequestHandler based on the CSVRequestHandler > that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into > Solr. > I am attaching a patch file with the code changes, and if this looks good, > will add a page similar to http://wiki.apache.org/solr/UpdateCSV. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.