Hello Sachin,

Once a URL gets filtered, by any plugin, it is rejected entirely.

If you want specific queries to pass the regex-urlfilter, you must let is pass 
explicitly above this -[?*!@=] line, e.g. +passThisQuery=

Use bin/nutch filterchecker -stdIn for quick testing.

Regards,
Markus

-----Original message-----
> From:Sachin Mittal <sjmit...@gmail.com>
> Sent: Monday 21st October 2019 14:22
> To: user@nutch.apache.org
> Subject: Adding specfic query parameters to nutch url filters
> 
> Hi,
> I have checked the regex-urlfilter and by default I see this line:
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> In my case for a particular url I want to crawl a specific query, so wanted
> to know what file would be the best to make changes to enable this.
> 
> Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
> and fast-urlfilter.
> 
> Would adding filters in any of the later two files would help.
> Any idea why these filters are added, like what would be the potential
> usecase.
> 
> Also say if I add multiple filter plugins backed by these files, then how
> url filtering works? Only those urls which pass all the plugins are
> selected to be fetched or any of the plugin?
> 
> Thanks
> Sachin
> 

Reply via email to