I created an issue for this improvement idea to make sure it doesn't just die away: https://issues.apache.org/jira/browse/SOLR-1763
-- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 22. jan. 2010, at 23.37, Jan Høydahl / Cominvent wrote: > On 8. des. 2009, at 00.29, Grant Ingersoll wrote: >> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote: >>> ASs someone with very little knowledge of Solr Cell and/or Tika, I find >>> myself wondering if ExtractingRequestHandler would make more sense as an >>> extractingUpdateProcessor -- where it could be configured to take take >>> either binary fields (or string fields containing URLs) out of the >>> Documents, parse them with tika, and add the various XPath matching hunks >>> of text back into the document as new fields. >>> >>> Then ExtractingRequestHandler just becomes a handler that slurps up it's >>> ContentStreams and adds them as binary data fields and adds the other >>> literal params as fields. >>> >>> Wouldn't that make things like SOLR-1358, and using Tika with >>> URLs/filepaths in XML and CSV based updates fairly trivial? >> >> It probably could, but am not sure how it works in a processor chain. >> However, I'm not sure I understand how they work all that much either. I >> also plan on adding, BTW, a SolrJ client for Tika that does the extraction >> on the client. In many cases, the ExtrReqHandler is really only designed >> for lighter weight extraction cases, as one would simply not want to send >> that much rich content over the wire. > > Good match. UpdateProcessors is the way to go for functionality which modifiy > documents prior to indexing. > With this, we can mix and match any type of content source with other > processing needs. > > I think it can be neneficial to have the choice to do extration on the SolrJ > side. But you don't always have that choice, if your source is a crawler > without built-in Tika, some base64 encoded field in an XML or some other > random source, you want to do the extraction at an arbitrary place in the > chain. > > Examples: > Crawler (httpheaders, binarybody) -> TikaUpdateProcessor (+title, +text, > +meta...) -> index > XML (title, pdfurl) -> GetUrlProcessor (+pdfbin) -> TikaUpdateProcessor > (+text, +meta) -> index > DIH (city, street, lat, lon) -> LatLon2GeoHashProcessor (+geohash) -> index > > I propose to model the document processor chain more after FAST ESP's > flexible processing chain, which must be seen as an industry best practice. > I'm thinking of starting a Wiki page to model what direction we should go. > > -- > Jan Høydahl - search architect > Cominvent AS - www.cominvent.com >