[jira] Updated: (SOLR-284) Parsing Rich Document Types

Yonik Seeley (JIRA) Mon, 13 Jul 2009 14:21:48 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yonik Seeley updated SOLR-284:
------------------------------

    Attachment: SOLR-284.patch

OK, here's my first crack at cleaning things up a little before release.  
Changes:
- there were no tests for XML attribute indexing.
- capture had no unit tests
- boost has no unit tests
- ignoring unknown fields had no unit test
- metadata prefix had no unit test
- logging ignored fields at the INFO level for each document loaded is too 
verbose
- removed handling of undeclared fields and let downstream components
  handle this.
- avoid the String catenation code for single valued fields when Tika only
  produces a single value (for performance)
- remove multiple literal detection handling for single valued fields - let a 
downstream component handle it
- map literal values just as one would with generated metadata, since the user 
may be just supplying the extra metadata.  also apply transforms (date 
formatting currently)
- fixed a bug where null field values were being added (and later dropped by 
Solr... hence it was never caught).
- avoid catching previously thrown SolrExceptions... let them fly through
- removed some unused code (id generation, etc)
- added lowernames option to map field names to lowercase/underscores
- switched builderStack from synchronized Stack to LinkedList 
- fixed a bug that caused content to be appended with no whitespace in between
- made extracting request handler lazy loading in example config
- added ignored_ and attr_ dynamic fields in example schema

Interface:
{code}
The default field is always "content" - use map to change it to something else
lowernames=true/false  // if true, map names like Content-Type to content_type
map.<fname>=<target_field>
boost.<fname>=<boost>
literal.<fname>=<literal_value>
xpath=<xpath_expr>  - only generate content for the matching xpath expr
extractOnly=true/false - if true, just return the extracted content
capture=<xml_element_name>  // separate out these elements 
captureAttr=<xml_element_name>   // separate out the attributes for these 
elements
uprefix=<prefix>  // unknown field prefix - any unknown fields will be 
prepended with this value
stream.type
resource.name
{code}

To try and make things more uniform, all fields, whether "content" or metadata 
or attributes or literals, all go through the same process.
1) map to lowercase if lowernames=true
2) apply map.field rules
3) if the resulting field is unknown, prefix it with uprefix

Hopefully people will agree that this is an improvement in general.  I think in 
the future we'll need more advanced options, esp around dealing with links in 
HTML and more powerful xpath constructs, but that's for after 1.4 IMO.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-284) Parsing Rich Document Types

Reply via email to