[
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yonik Seeley updated SOLR-284:
------------------------------
Attachment: SOLR-284.patch
OK, here's my first crack at cleaning things up a little before release.
Changes:
- there were no tests for XML attribute indexing.
- capture had no unit tests
- boost has no unit tests
- ignoring unknown fields had no unit test
- metadata prefix had no unit test
- logging ignored fields at the INFO level for each document loaded is too
verbose
- removed handling of undeclared fields and let downstream components
handle this.
- avoid the String catenation code for single valued fields when Tika only
produces a single value (for performance)
- remove multiple literal detection handling for single valued fields - let a
downstream component handle it
- map literal values just as one would with generated metadata, since the user
may be just supplying the extra metadata. also apply transforms (date
formatting currently)
- fixed a bug where null field values were being added (and later dropped by
Solr... hence it was never caught).
- avoid catching previously thrown SolrExceptions... let them fly through
- removed some unused code (id generation, etc)
- added lowernames option to map field names to lowercase/underscores
- switched builderStack from synchronized Stack to LinkedList
- fixed a bug that caused content to be appended with no whitespace in between
- made extracting request handler lazy loading in example config
- added ignored_ and attr_ dynamic fields in example schema
Interface:
{code}
The default field is always "content" - use map to change it to something else
lowernames=true/false // if true, map names like Content-Type to content_type
map.<fname>=<target_field>
boost.<fname>=<boost>
literal.<fname>=<literal_value>
xpath=<xpath_expr> - only generate content for the matching xpath expr
extractOnly=true/false - if true, just return the extracted content
capture=<xml_element_name> // separate out these elements
captureAttr=<xml_element_name> // separate out the attributes for these
elements
uprefix=<prefix> // unknown field prefix - any unknown fields will be
prepended with this value
stream.type
resource.name
{code}
To try and make things more uniform, all fields, whether "content" or metadata
or attributes or literals, all go through the same process.
1) map to lowercase if lowernames=true
2) apply map.field rules
3) if the resulting field is unknown, prefix it with uprefix
Hopefully people will agree that this is an improvement in general. I think in
the future we'll need more advanced options, esp around dealing with links in
HTML and more powerful xpath constructs, but that's for after 1.4 IMO.
> Parsing Rich Document Types
> ---------------------------
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Eric Pugh
> Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch,
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch,
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch,
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch,
> solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip,
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into
> Solr.
> There is a wiki page with information here:
> http://wiki.apache.org/solr/UpdateRichDocuments
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.