Hi,

the pattern should work. Of course, you need to make sure that
- there are no other patterns coming before in regex-urlfilter.txt
  which cause the URL to be rejected
- other URL filters being active which reject the URL
- make sure that the folder of the regex-urlfilter.txt you're editing
  is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is 
used
- (optionally) you may simplify the regex: the characters /_= have no special 
semantic
  and do not need to be escaped by \

The easiest way to test it (Nutch 1.15):
% cat $NUTCH_HOME/conf/regex-urlfilter.txt
+^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gainers_losers\.htm\?cat=([GL])
-.
% echo 
"https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G)"
 \
   | nutch filterchecker -filterName urlfilter-regex -stdin
Checking combination of these URLFilters: RegexURLFilter
+https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G)


And with another "forbidden" URL:
% echo 
"https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X)"
 \
  | nutch filterchecker -filterName urlfilter-regex -stdin
Checking combination of these URLFilters: RegexURLFilter
-https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X)


Best,
Sebastian

On 12/1/18 2:45 PM, Venkata MR wrote:
> Hi Nutch Users,
> 
> I was trying to crawl the site 
> (https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G,
>  
> https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=L),
>  with the filter patter as 
> "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])",
>  it is rejecting the urls.
> 
> Tried multiple options but all the cases it is rejecting.
> 
> Any help here is appreciated, Thanks!
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only. E-mail transmission is not 
> guaranteed to be secure or error-free as information could be intercepted, 
> corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses 
> in transmission. The e mail and its contents (with or without referred 
> errors) shall therefore not attach any liability on the originator or HCL or 
> its affiliates. Views or opinions, if any, presented in this email are solely 
> those of the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, copying, 
> disclosure, modification, distribution and / or publication of this message 
> without the prior written consent of authorized representative of HCL is 
> strictly prohibited. If you have received this email in error please delete 
> it and notify the sender immediately. Before opening any email and/or 
> attachments, please check them for viruses and other defects.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 

Reply via email to