[jira] Commented: (SOLR-284) Parsing Rich Document Types

Grant Ingersoll (JIRA) Wed, 12 Nov 2008 08:38:07 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646947#action_12646947
 ]


Grant Ingersoll commented on SOLR-284:
--------------------------------------

Some initial thoughts on moving forward:

I think we can add some generic functionality here via the request params:

1. Tika can provide a lot of metadata about a document.  By metadata, I mean 
things like the actual author, pages, etc. as provided by the document, not the 
hardcoded metadata in the http://wiki.apache.org/solr/UpdateRichDocuments.  The 
hardcoded metadata is also useful and should be retained.  With these, we then 
need a way to map fields from Tika's metadata to Solr fields.  If no mapping is 
specified, it tries to use the Tika metadata name as the field name.  If that 
doesn't exist, then we can rely on dynamic fields or we can allow for a param 
that passes in the name of a default field to map to.

2.  We can auto detect the mime type or allow for it to be passed in.  Thus, 
stream.type becomes optional, but is still useful.

3. Tika provides a mechanism for implementing your own SAX ContentHandler and 
passing that in.  I will likely make this pluggable such that people can 
provide there own.  I _think_ this would allow people to make even further 
refinements to the content (i.e. splitting on paragraphs or other things like 
that?????)

I should have a start of a patch today or tomorrow.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to