I am working on an RequestHandler that incorporates Aperture (http://
aperture.sourceforge.net) into Solr. Aperture is a crawling and
extraction framework based on RDF that provides a common interface to
disparate Open Source libraries like PDFBox, POI, OpenOffice, as well
as data stores like IMAP and also does crawling of HTTP, File
systems, etc. It has similar goals to Tika (a Lucene TLP sub-
project) but is much further along in my opinion (although I do
notice that Tika has picked up the pace lately). Tika could easily
be dropped in as a replacement at any point in the future (or other
extraction libraries, too). I also have a client-side version using
SolrJ and Aperture. This would be related to https://
issues.apache.org/jira/browse/SOLR-284 but I haven't looked for
synergies between Eric's idea and mine. I will do that.
I know I could put this in the core as a ReqHandler just like all the
others, but it doesn't really seem like it fits there, especially due
to having a fair number of dependencies (Aperture, PDFBox, POI, etc.)
I would like to suggest we start a contrib package for Solr modeled
after the Lucene Java contrib package. One thing that comes to mind,
is do we just want to mirror the processes of Lucene Java or do we
think there are improvements to be made? One thing that I dislike
about the current Lucene Java way is the dependency management. Some
of the contrib modules have the same copy of libraries checked in or
they rely on non-ASF compatible code. Maven or Ivy easily solve this
problem, with my preference being Maven (but I am not trying to start
a Maven war here, either, so please don't take it that way).
Anyone have thoughts on this? I will submit a patch at some point in
the near future.
-Grant
- Solr contrib Grant Ingersoll
-