Hi All:

I’m currently finishing a custom plugin that allows the filtering of documents 
in the indexing stage (implemented as and IndexingFilter for Nutch 1.x) 
basically it allows to configure which type of documents would you like to end 
up indexed in Solr. The need of this plugin (in our case) came when we needed 
to build an image search engine using nutch, basically we want to crawl/parse 
all the default formats (searching for links to images) but only index in Solr 
does documents that were actually an image. The plugin use an auxiliar config 
file with a syntax similar to SuffixURLFilter, so you could allow everything 
except a few formats, or deny everything except the formats of your interest. 

The work it’s almost done and I’m writing a few tests to keep things consistent 
and organized, my question is if there is anyone interested in such a plugin to 
share it. Really the plugin it’s dead simple and the same could be very well 
accomplished by anyone.

Regards,VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de 
julio de 2014. Ver www.uci.cu

Reply via email to