Hi - we just modify parse-plugins to only parse what we want to parse, those 
documents are never indexed anyway, and we can skip the parsing.

 
 
-----Original message-----
> From:Sebastian Nagel <[email protected]>
> Sent: Monday 21st July 2014 23:13
> To: [email protected]
> Subject: Re: Filtering indexing of documents by MIME Type
> 
> Hi Jorge,
> 
> yes, such a plugin could be of general interest.
> Not only for the use case of an image search,
> but also because URL filters are not always capable
> to filter undesired MIME types away (e.g., .php URL
> may deliver PDF content).
> 
> Would be great if you could open an issue in Jira and provide
> a patch for the MIME type filter plugin.
> 
> Thanks,
> Sebastian
> 
> On 07/17/2014 09:11 PM, Jorge Luis Betancourt Gonzalez wrote:
> > Hi All:
> > 
> > I’m currently finishing a custom plugin that allows the filtering of 
> > documents in the indexing stage (implemented as and IndexingFilter for 
> > Nutch 1.x) basically it allows to configure which type of documents would 
> > you like to end up indexed in Solr. The need of this plugin (in our case) 
> > came when we needed to build an image search engine using nutch, basically 
> > we want to crawl/parse all the default formats (searching for links to 
> > images) but only index in Solr does documents that were actually an image. 
> > The plugin use an auxiliar config file with a syntax similar to 
> > SuffixURLFilter, so you could allow everything except a few formats, or 
> > deny everything except the formats of your interest. 
> > 
> > The work it’s almost done and I’m writing a few tests to keep things 
> > consistent and organized, my question is if there is anyone interested in 
> > such a plugin to share it. Really the plugin it’s dead simple and the same 
> > could be very well accomplished by anyone.
> > 
> > Regards,VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 
> > de julio de 2014. Ver www.uci.cu
> > 
> 
> 

Reply via email to