Hi All: I’m currently finishing a custom plugin that allows the filtering of documents in the indexing stage (implemented as and IndexingFilter for Nutch 1.x) basically it allows to configure which type of documents would you like to end up indexed in Solr. The need of this plugin (in our case) came when we needed to build an image search engine using nutch, basically we want to crawl/parse all the default formats (searching for links to images) but only index in Solr does documents that were actually an image. The plugin use an auxiliar config file with a syntax similar to SuffixURLFilter, so you could allow everything except a few formats, or deny everything except the formats of your interest.
The work it’s almost done and I’m writing a few tests to keep things consistent and organized, my question is if there is anyone interested in such a plugin to share it. Really the plugin it’s dead simple and the same could be very well accomplished by anyone. Regards,VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

