I created an issue for this improvement idea to make sure it doesn't just die 
away:
https://issues.apache.org/jira/browse/SOLR-1763

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 22. jan. 2010, at 23.37, Jan Høydahl / Cominvent wrote:

> On 8. des. 2009, at 00.29, Grant Ingersoll wrote:
>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find 
>>> myself wondering if ExtractingRequestHandler would make more sense as an 
>>> extractingUpdateProcessor -- where it could be configured to take take 
>>> either binary fields (or string fields containing URLs) out of the 
>>> Documents, parse them with tika, and add the various XPath matching hunks 
>>> of text back into the document as new fields.
>>> 
>>> Then ExtractingRequestHandler just becomes a handler that slurps up it's 
>>> ContentStreams and adds them as binary data fields and adds the other 
>>> literal params as fields.
>>> 
>>> Wouldn't that make things like SOLR-1358, and using Tika with 
>>> URLs/filepaths in XML and CSV based updates fairly trivial?
>> 
>> It probably could, but am not sure how it works in a processor chain.  
>> However, I'm not sure I understand how they work all that much either.  I 
>> also plan on adding, BTW, a SolrJ client for Tika that does the extraction 
>> on the client.  In many cases, the ExtrReqHandler is really only designed 
>> for lighter weight extraction cases, as one would simply not want to send 
>> that much rich content over the wire.
> 
> Good match. UpdateProcessors is the way to go for functionality which modifiy 
> documents prior to indexing.
> With this, we can mix and match any type of content source with other 
> processing needs.
> 
> I think it can be neneficial to have the choice to do extration on the SolrJ 
> side. But you don't always have that choice, if your source is a crawler 
> without built-in Tika, some base64 encoded field in an XML or some other 
> random source, you want to do the extraction at an arbitrary place in the 
> chain.
> 
> Examples:
>  Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, 
> +meta...) -> index
>  XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor 
> (+text, +meta) -> index
>  DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index
> 
> I propose to model the document processor chain more after FAST ESP's 
> flexible processing chain, which must be seen as an industry best practice. 
> I'm thinking of starting a Wiki page to model what direction we should go.
> 
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
> 

Reply via email to