Hi Jorge, yes, such a plugin could be of general interest. Not only for the use case of an image search, but also because URL filters are not always capable to filter undesired MIME types away (e.g., .php URL may deliver PDF content).
Would be great if you could open an issue in Jira and provide a patch for the MIME type filter plugin. Thanks, Sebastian On 07/17/2014 09:11 PM, Jorge Luis Betancourt Gonzalez wrote: > Hi All: > > I’m currently finishing a custom plugin that allows the filtering of > documents in the indexing stage (implemented as and IndexingFilter for Nutch > 1.x) basically it allows to configure which type of documents would you like > to end up indexed in Solr. The need of this plugin (in our case) came when we > needed to build an image search engine using nutch, basically we want to > crawl/parse all the default formats (searching for links to images) but only > index in Solr does documents that were actually an image. The plugin use an > auxiliar config file with a syntax similar to SuffixURLFilter, so you could > allow everything except a few formats, or deny everything except the formats > of your interest. > > The work it’s almost done and I’m writing a few tests to keep things > consistent and organized, my question is if there is anyone interested in > such a plugin to share it. Really the plugin it’s dead simple and the same > could be very well accomplished by anyone. > > Regards,VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 > de julio de 2014. Ver www.uci.cu >

