[jira] Commented: (SOLR-284) Parsing Rich Document Types

Chris Harris (JIRA) Thu, 20 Nov 2008 16:13:11 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649551#action_12649551
 ]


Chris Harris commented on SOLR-284:
-----------------------------------

A few comment on the ExtractingDocumentLoader:

* I think I like where this is going.

* Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == 
false, which means that if Tika returns a metadata field and you haven't made 
an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr 
will throw an exception and your document add will fail. This doesn't seem 
sound very robust for a production environment, unless Tika will only ever use 
a finite list of metadata field names. (That doesn't sound plausible, though I 
admit I haven't looked into it.) Even in that case, I think I'd rather not have 
to set up a mapping for every possible field name in order to get started with 
this handler. Would true perhaps be a better default?

* ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this 
feature, Grant? The example in the patch is of routing text from <div> tags to 
one Solr field while routing text from other tags to a different Solr field. 
I'm kind of curious when this would be useful, especially keeping in mind that, 
in general, Tika source documents are not HTML, and so when <div> tags are 
generated they're as much artifacts of Tika as reflecting anything in the 
underlying document. (You could maybe ask a similar question about ext.inx.attr 
/ INDEX_ATTRIBUTES.)


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to