Integrating Extraction w/ DIH is a better option. DIH makes it easier
to do the mapping of fields etc.


On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gsing...@apache.org> wrote:
>
> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>
>>
>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find 
>> myself wondering if ExtractingRequestHandler would make more sense as an 
>> extractingUpdateProcessor -- where it could be configured to take take 
>> either binary fields (or string fields containing URLs) out of the 
>> Documents, parse them with tika, and add the various XPath matching hunks of 
>> text back into the document as new fields.
>>
>> Then ExtractingRequestHandler just becomes a handler that slurps up it's 
>> ContentStreams and adds them as binary data fields and adds the other 
>> literal params as fields.
>>
>> Wouldn't that make things like SOLR-1358, and using Tika with URLs/filepaths 
>> in XML and CSV based updates fairly trivial?
>
> It probably could, but am not sure how it works in a processor chain.  
> However, I'm not sure I understand how they work all that much either.  I 
> also plan on adding, BTW, a SolrJ client for Tika that does the extraction on 
> the client.  In many cases, the ExtrReqHandler is really only designed for 
> lighter weight extraction cases, as one would simply not want to send that 
> much rich content over the wire.



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Reply via email to