[jira] Commented: (SOLR-284) Parsing Rich Document Types

Yonik Seeley (JIRA) Sat, 27 Jun 2009 09:59:10 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724871#action_12724871
 ]


Yonik Seeley commented on SOLR-284:
-----------------------------------

Apologies for not reviewing this sooner after it was committed - but this is 
the last/best chance to improve the interface before 1.4 is released (and this 
is very important new functionality).

Since the "ext." seems unnecessary and removing is already a name change, we 
might as well revisit the names themselves anyway.  Here are my first thoughts 
on it:
{code}
//////// generic type stuff that could be reused by other update handlers
boost.myfield=2.3
literal.myfield=Hello
map.origfield=newfield
uprefix=attr_ 
  // map any unknown fields using a standard prefix... good for
  // dynamic field mapping.

//////// more solr cell specific
capture.target_field=div
  // does capture + field-map in single step... avoids name clashes
xpath=xpath_expr
  // future: could do xpath.targetfield=xpath_expr
extract_only=true  // period's aren't word separators, but scoping operators
 // in the future, this could be replaced with a generic update operation
 // to return the document(s) instead of indexing them.
resource.name=test.pdf

New idea:
  nicenames=true // Last-Modified -> last_modified


REMOVED:
ext.ignore.und.fl 
  // throwing an exception when a field-type doesn't exist is generic
  // and not needed.  we should never silently ignore.
ext.idx.attr
  // do we ever want this to be false?  we can ignore all attributes
  // with field mappings if we want to
ext.metadata.prefix
  // seems like we only want to map unknown fields, not all fields
ext.def.fl 
  // we can use a standard field name for indexing main content
  // and use map to move it if desired. "content"? 
{code}

Do people view this as an improvement?

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, 
> test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to