I like the idea of providing a home for all these other non core
projects as well. I think my approach in SOLR-284 could be used as
a starting point, or for ideas, but it was aimed fairly specifically
at scratching my itch.
It does seem like parsing rich documents is a popular request!
Eric
On Sep 25, 2007, at 3:24 PM, Grant Ingersoll wrote:
I am working on an RequestHandler that incorporates Aperture
(http://aperture.sourceforge.net) into Solr. Aperture is a
crawling and extraction framework based on RDF that provides a
common interface to disparate Open Source libraries like PDFBox,
POI, OpenOffice, as well as data stores like IMAP and also does
crawling of HTTP, File systems, etc. It has similar goals to Tika
(a Lucene TLP sub-project) but is much further along in my opinion
(although I do notice that Tika has picked up the pace lately).
Tika could easily be dropped in as a replacement at any point in
the future (or other extraction libraries, too). I also have a
client-side version using SolrJ and Aperture. This would be
related to https://issues.apache.org/jira/browse/SOLR-284 but I
haven't looked for synergies between Eric's idea and mine. I will
do that.
I know I could put this in the core as a ReqHandler just like all
the others, but it doesn't really seem like it fits there,
especially due to having a fair number of dependencies (Aperture,
PDFBox, POI, etc.)
I would like to suggest we start a contrib package for Solr modeled
after the Lucene Java contrib package. One thing that comes to
mind, is do we just want to mirror the processes of Lucene Java or
do we think there are improvements to be made? One thing that I
dislike about the current Lucene Java way is the dependency
management. Some of the contrib modules have the same copy of
libraries checked in or they rely on non-ASF compatible code.
Maven or Ivy easily solve this problem, with my preference being
Maven (but I am not trying to start a Maven war here, either, so
please don't take it that way).
Anyone have thoughts on this? I will submit a patch at some point
in the near future.
-Grant
-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com