RE: Adding specfic query parameters to nutch url filters

2019-10-21 Thread Markus Jelsma
Hello Sachin,

Once a URL gets filtered, by any plugin, it is rejected entirely.

If you want specific queries to pass the regex-urlfilter, you must let is pass 
explicitly above this -[?*!@=] line, e.g. +passThisQuery=

Use bin/nutch filterchecker -stdIn for quick testing.

Regards,
Markus

-Original message-
> From:Sachin Mittal 
> Sent: Monday 21st October 2019 14:22
> To: user@nutch.apache.org
> Subject: Adding specfic query parameters to nutch url filters
> 
> Hi,
> I have checked the regex-urlfilter and by default I see this line:
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> In my case for a particular url I want to crawl a specific query, so wanted
> to know what file would be the best to make changes to enable this.
> 
> Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
> and fast-urlfilter.
> 
> Would adding filters in any of the later two files would help.
> Any idea why these filters are added, like what would be the potential
> usecase.
> 
> Also say if I add multiple filter plugins backed by these files, then how
> url filtering works? Only those urls which pass all the plugins are
> selected to be fetched or any of the plugin?
> 
> Thanks
> Sachin
> 


Adding specfic query parameters to nutch url filters

2019-10-21 Thread Sachin Mittal
Hi,
I have checked the regex-urlfilter and by default I see this line:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

In my case for a particular url I want to crawl a specific query, so wanted
to know what file would be the best to make changes to enable this.

Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
and fast-urlfilter.

Would adding filters in any of the later two files would help.
Any idea why these filters are added, like what would be the potential
usecase.

Also say if I add multiple filter plugins backed by these files, then how
url filtering works? Only those urls which pass all the plugins are
selected to be fetched or any of the plugin?

Thanks
Sachin