[jira] Commented: (SOLR-284) Parsing Rich Document Types

Grant Ingersoll (JIRA) Fri, 14 Nov 2008 06:43:11 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647618#action_12647618
 ]


Grant Ingersoll commented on SOLR-284:
--------------------------------------

Question for the people watching this:

Would you prefer a new wiki page and keep the old one for those using 
Chris/Eric's patch, or would you rather I overwrite/edit the current one?

FWIW, some of the parameters will be the same, but I'm also adding in quite a 
bit more: boosting, XPath expression support (Tika returns everything as XHTML, 
so it then becomes possible to restrict down what parts you want to pay 
attention to), extraction only (i.e. no indexing), support for metadata 
extraction and indexing, support for sending in "literals" which are like the 
current fieldnames parameter and likely some other pieces.

FYI: Out of the box, Tika has support for: 
http://incubator.apache.org/tika/formats.html and I know they are adding more 
things as well, like Flash, etc.

It should also be noted, that if you are just indexing metadata about a file, 
it makes more sense to do the work on the client side.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to