ASs someone with very little knowledge of Solr Cell and/or Tika, I find
myself wondering if ExtractingRequestHandler would make more sense as an
extractingUpdateProcessor -- where it could be configured to take take
either binary fields (or string fields containing URLs) out of the
Documents, parse them with tika, and add the various XPath matching hunks
of text back into the document as new fields.
Then ExtractingRequestHandler just becomes a handler that slurps up it's
ContentStreams and adds them as binary data fields and adds the other
literal params as fields.
Wouldn't that make things like SOLR-1358, and using Tika with
URLs/filepaths in XML and CSV based updates fairly trivial?
-Hoss
- Solr Cell revamped as ... Chris Hostetter
-