[jira] Commented: (SOLR-284) Parsing Rich Document Types

Grant Ingersoll (JIRA) Mon, 29 Jun 2009 14:19:12 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725355#action_12725355
 ]


Grant Ingersoll commented on SOLR-284:
--------------------------------------

bq. ext.ignore.und.fl

I think this should be kept and this is a case where we should silently ignore. 
 Parsing rich data is a different beast than normal Solr XML or other 
structured content.  There are a lot of times where you only want to get 
specific fields and there can be a large number of fields.  It is burdensome to 
have to add the ignores for all the metadata.  Not to mention different types 
may have different metadata.  So, -1 on removing.

bq. ext.idx.attr

Yes, we may want it to be false.  That's why I put it in!  :-)  It can be used 
to extract things like HREF into other fields or not.  Think faceting.

bq. ext.metadata.prefix

This is not a mapping thing so much as a way to separately handle metadata 
fields from the main text fields.  I'm not sure if it differs from the uprefix 
approach you are proposing except you can know exactly what is metadata and 
what isn't.


Other questions that Yonik brought up:

1. I don't think trying to auto map is a good idea.  New file formats will have 
new ways of doing them, it's better to have the user handle it.  
2. Fine with dropping ext for common names
3. Metadata is often not useful and I don't think we need to do work as 
suggested.  See Eric's comment above.
4. Enabling by default is fine.
 

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, 
> test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to