Hi - we just modify parse-plugins to only parse what we want to parse, those documents are never indexed anyway, and we can skip the parsing.
-----Original message----- > From:Sebastian Nagel <[email protected]> > Sent: Monday 21st July 2014 23:13 > To: [email protected] > Subject: Re: Filtering indexing of documents by MIME Type > > Hi Jorge, > > yes, such a plugin could be of general interest. > Not only for the use case of an image search, > but also because URL filters are not always capable > to filter undesired MIME types away (e.g., .php URL > may deliver PDF content). > > Would be great if you could open an issue in Jira and provide > a patch for the MIME type filter plugin. > > Thanks, > Sebastian > > On 07/17/2014 09:11 PM, Jorge Luis Betancourt Gonzalez wrote: > > Hi All: > > > > I’m currently finishing a custom plugin that allows the filtering of > > documents in the indexing stage (implemented as and IndexingFilter for > > Nutch 1.x) basically it allows to configure which type of documents would > > you like to end up indexed in Solr. The need of this plugin (in our case) > > came when we needed to build an image search engine using nutch, basically > > we want to crawl/parse all the default formats (searching for links to > > images) but only index in Solr does documents that were actually an image. > > The plugin use an auxiliar config file with a syntax similar to > > SuffixURLFilter, so you could allow everything except a few formats, or > > deny everything except the formats of your interest. > > > > The work it’s almost done and I’m writing a few tests to keep things > > consistent and organized, my question is if there is anyone interested in > > such a plugin to share it. Really the plugin it’s dead simple and the same > > could be very well accomplished by anyone. > > > > Regards,VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 > > de julio de 2014. Ver www.uci.cu > > > >

