I like the idea of providing a home for all these other non core projects as well. I think my approach in SOLR-284 could be used as a starting point, or for ideas, but it was aimed fairly specifically at scratching my itch.

It does seem like parsing rich documents is a popular request!

Eric


On Sep 25, 2007, at 3:24 PM, Grant Ingersoll wrote:

I am working on an RequestHandler that incorporates Aperture (http://aperture.sourceforge.net) into Solr. Aperture is a crawling and extraction framework based on RDF that provides a common interface to disparate Open Source libraries like PDFBox, POI, OpenOffice, as well as data stores like IMAP and also does crawling of HTTP, File systems, etc. It has similar goals to Tika (a Lucene TLP sub-project) but is much further along in my opinion (although I do notice that Tika has picked up the pace lately). Tika could easily be dropped in as a replacement at any point in the future (or other extraction libraries, too). I also have a client-side version using SolrJ and Aperture. This would be related to https://issues.apache.org/jira/browse/SOLR-284 but I haven't looked for synergies between Eric's idea and mine. I will do that.

I know I could put this in the core as a ReqHandler just like all the others, but it doesn't really seem like it fits there, especially due to having a fair number of dependencies (Aperture, PDFBox, POI, etc.)

I would like to suggest we start a contrib package for Solr modeled after the Lucene Java contrib package. One thing that comes to mind, is do we just want to mirror the processes of Lucene Java or do we think there are improvements to be made? One thing that I dislike about the current Lucene Java way is the dependency management. Some of the contrib modules have the same copy of libraries checked in or they rely on non-ASF compatible code. Maven or Ivy easily solve this problem, with my preference being Maven (but I am not trying to start a Maven war here, either, so please don't take it that way).

Anyone have thoughts on this? I will submit a patch at some point in the near future.

-Grant



-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com



Reply via email to