[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646947#action_12646947 ]
Grant Ingersoll commented on SOLR-284: -------------------------------------- Some initial thoughts on moving forward: I think we can add some generic functionality here via the request params: 1. Tika can provide a lot of metadata about a document. By metadata, I mean things like the actual author, pages, etc. as provided by the document, not the hardcoded metadata in the http://wiki.apache.org/solr/UpdateRichDocuments. The hardcoded metadata is also useful and should be retained. With these, we then need a way to map fields from Tika's metadata to Solr fields. If no mapping is specified, it tries to use the Tika metadata name as the field name. If that doesn't exist, then we can rely on dynamic fields or we can allow for a param that passes in the name of a default field to map to. 2. We can auto detect the mime type or allow for it to be passed in. Thus, stream.type becomes optional, but is still useful. 3. Tika provides a mechanism for implementing your own SAX ContentHandler and passing that in. I will likely make this pluggable such that people can provide there own. I _think_ this would allow people to make even further refinements to the content (i.e. splitting on paragraphs or other things like that?????) I should have a start of a patch today or tomorrow. > Parsing Rich Document Types > --------------------------- > > Key: SOLR-284 > URL: https://issues.apache.org/jira/browse/SOLR-284 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Eric Pugh > Assignee: Grant Ingersoll > Fix For: 1.4 > > Attachments: libs.zip, rich.patch, rich.patch, rich.patch, > rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, > test-files.zip, test.zip, un-hardcode-id.diff > > > I have developed a RichDocumentRequestHandler based on the CSVRequestHandler > that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into > Solr. > There is a wiki page with information here: > http://wiki.apache.org/solr/UpdateRichDocuments > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.